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ABSTRACT 



Lahti, L. (2010): Probabilistic analysis of the human transcriptome with 
side information Doctoral thesis, Aalto University School of Science and Tech- 
nology, Dissertations in Information and Computer Science, TKK-ICS-D19, Espoo, 
Finland. 

Keywords: data integration, exploratory data analysis, functional genomics, 
probabilistic modeling, transcriptomics 

Recent advances in high-throughput measurement technologies and efficient 
sharing of biomedical data through community databases have made it possible to 
investigate the complete collection of genetic material, the genome, which encodes 
the heritable genetic program of an organism. This has opened up new views to 
the study of living organisms with a profound impact on biological research. 

Functional genomics is a subdiscipline of molecular biology that investigates the 
functional organization of genetic information. This thesis develops computational 
strategies to investigate a key functional layer of the genome, the transcriptome. 
The time- and context-specific transcriptional activity of the genes regulates the 
function of living cells through protein synthesis. Efficient computational tech- 
niques are needed in order to extract useful information from high-dimensional 
genomic observations that are associated with high levels of complex variation. 
Statistical learning and probabilistic models provide the theoretical framework 
for combining statistical evidence across multiple observations and the wealth of 
background information in genomic data repositories. 

This thesis addresses three key challenges in transcriptome analysis. First, 
new preprocessing techniques that utilize side information in genomic sequence 
databases and microarray collections are developed to improve the accuracy of 
high-throughput microarray measurements. Second, a novel exploratory approach 
is proposed in order to construct a global view of cell-biological network acti- 
vation patterns and functional relatedness between tissues across normal human 
body. Information in genomic interaction databases is used to derive constraints 
that help to focus the modeling in those parts of the data that are supported 
by known or potential interactions between the genes, and to scale up the analy- 
sis. The third contribution is to develop novel approaches to model dependency 
between co-occurring measurement sources. The methods are used to study can- 
cer mechanisms and transcriptome evolution; integrative analysis of the human 
transcriptome and other layers of genomic information allows the identification of 
functional mechanisms and interactions that could not be detected based on the 
individual measurement sources. Open source implementations of the key method- 
ological contributions have been released to facilitate their further adoption by the 
research community. 
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Lahti, L. (2010): Ihmisen geenien ilmentymisen ja taustatiedon tilcistoUi- 
nen mallitus Vaitoskirja, Aalto-yliopiston tcknillincn korkcakoulu, Dissertations 
in Information and Computer Science, TKK-ICS-D19, Espoo, Suomi. 

Avainsanat: aineistojen yhdistely, data-analyysi, toiminnallinen genomiikka, ti- 
lastoUinen mallitus, geenien ilmentyminen 

Mittausmenetelmien kehitys ja tutkimustiedon laajentunut saatavuus ovat mah- 
doUistaneet ihmisen periman eli genomin kokonaisvaltaisen tarkastclun. Tama on 
avannut uusia nakokulmia biologiseen tutkimukseen ja auttanut ymmartamaan 
elaman syntya ja rakennetta uusin tavoin. Toiminnallinen genomiikka on molekyyli- 
biologian osa-aluc, joka tutkii periman toiminnallisia ominaisuuksia. Periman 
toimintaan liittyvaa mittausaineistoa on runsaasti saatavilla, mutta korkeaulot- 
teisiin mittauksiin liittyy monimutkaisia ja tuntemattomia taustatekijoita, joiden 
huomiointi mallitukscssa on haastccllista. Tchokkaat laskcnnallisct mcnctclmat 
ovat avainasemassa pyrittaessa jalostamaan uusista havainnoista kayttokelpoista 
tietoa. 

Tassa vaitoskirjassa on kehitetty ylciskayttoisia laskcnnallisia menetelmia, joilla 
voidaan tutkia ihmisen geenien ilmentymista koko periman tasoha. Geenien il- 
mentyminen viittaa lahetti-RNA-molekyylien tuottoon solussa periman sisaltaman 
informaation nojaUa. Tama on kcskcincn pcrinnoUiscn informaation saatelytaso, 
jonka avuUa solu saatelee proteiinien tuottoa ja solun toimintaa ajasta ja tilanteesta 
riippuen. Tilastollinen oppiminen ja todennakoisyyksin perustuva probabilistinen 
mahitus tarjoavat teorccttiscn kchykscn, jonka avuUa rinnakkaisiin mittaiiksiin 
ja taustatietoihin sisaltyvaa informaatiota voidaan kayttaa kasvattamaan mallien 
tilastollista voimaa. Kehitetyt menetelmat ovat ylciskayttoisia laskennallisen tie- 
tccn tutkimusvahncita, jotka tckcvat vahan, mutta selkcasti ilmaistuja malhtusole- 
tuksia ja sietavat korkeaulotteisiin toiminnahisen genomiikan havaintoaineistoihin 
sisaltyvia epavarmuuksia. 

Vaitoskirjassa kehitetyt mc^ncitelmiit tarjoavat ratkaisuja kohncx^n kcskcisccn 
malhtusongelmaan toiminnalhsessa genomiikassa. Luotettavien esikasittelymene- 
telmien kehittaminen on tyon ensimmainen paatulos, jossa tietokantoihin sisalty- 
vaa taustatietoa kaytetaan perimanlaajuisten mittausaincistojcn epavarmuuksien 
vahentamiseksi. Toisena paatuloksena vaitoskirjassa kehitetaan uusi ahavaruus- 
kasautukseen perustuva menetelma, jonka avulla voidaan tutkia ja kuvata solu- 
biologisen vuorovaikutusvcrkon kayttaytymista kokonaisvahaiscsti ihmiskchon cri 
osissa. Taustatietoa geenien vuorovaikutuksista kaytetaan ohjaamaan ja nopeut- 
tamaan mallitusta. Menetelmalla saadaan uutta tietoa geenien saatelysta ja ku- 
dostcn toiminnalhsista yhtcyksista. Kolmanncksi vaitoskirjatyossa kehitetaan uu- 
sia menetelmia perimanlaajuisten mittausaincistojcn yhdistclyyn. Ihmisen geenien 
ilmentymisen ja muiden aineistojen riippuvuuksicn mallitus mahdoUistaa sellaistcn 
toiminnalhsten yhteyksicn ja vuorovaikutustcn havaitscmiscn, joiden tutkimiseksi 
yksittaiset havaintoaineistot ovat riittamattomia. Aineistojen yhdistelyyn kehitet- 
tyja menetelmia sovelletaan syopamekanismien ja lajien valisten eroavaisuuksien 
tutkimiscen. Julkaistuilla avoimen lahdekoodin toteutuksilla on pyritty varmista- 
maan kehitettyjen menetelmien saatavuus ja laajempi kayttoonotto laskennaUisen 
biologian tutkimuksessa. 
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Chapter 1 

Introduction 



Revolutions in measurement technologies have led to revolutions in science and 
society. Introduction of the microscope in the 17th century opened a new view to 
the world of living organisms and enabled the study of life processes at cellular 
level. Since then, new techniques have been developed to investigate ever smaller 
objects. The discovery of the molecular structure of the DNA in 1953 (Watson 
and Crick, 1953) led to the establishment of genes as fundamental imits of ge- 
netic information that is passed on between generations. The draft sequence of 
the human genome, covering three billion DNA base pairs, was published in 2001 
(International human genome sequencing consortium, 2001; Venter et al., 2001). 
Modern measurement technologies provide researchers with large volumes of data 
concerning the structure, function, and interactions of genes and their products. 
Rapid accumulation of genomic data in shared community databases has accel- 
erated biological research (Cochrane and Galperin, 2010), but the structural and 
functional organization of genetic information is still poorly understood. While 
functional roles of individual genes have been characterized, little is known re- 
garding the higher-level regularities and interactions from which the complexity 
and diversity of life emerges. The quest for systems-level understanding of genome 
function is a major paradigm in modern biology (Collins et al., 2003). 

Computational science has a key role in transforming the genomic data collec- 
tions into new biological knowledge (Cohen, 2004). New observations allow the 
formulation of new research questions, but also bring new challenges (Barbour 
et al., 2005). The sheer size of high-throughput data sets makes them incompre- 
hensible for human mind, and the complexity of biological phenomena and high 
levels of uncontrolled variation set specific challenges for computational analysis 
(Tilstone, 2003; Troyanskaya, 2005). Filtering relevant information from statisti- 
cally uncertain high-dimensional data is a challenging task where new computa- 
tional methods arc needed to organize and summarize the overwhelming volumes 
of observational data into a comprehensible form to make new discoveries about 
the structure of life; computation is a new microscope for studying massive data 
sets. 

This thesis develops principled exploratory methods to investigate the human 
transcriptome. It is a central functional layer of the genome and a significant 
source of phcnotypic variation. The transcriptome refers to the complete collec- 
tion of messenger-RNA transcripts of an organism. The essentially static genome 
sequence regulates the time- and context-specific patterns of transcriptional ac- 
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tivity of the genes, and subsequently the function of hving cells through protein 
synthesis. An average cell contains over 300,000 mRNA molecules and the expres- 
sion levels of individual genes span 4-5 orders of magnitude (Carninci, 2009). A 
wealth of associated genomic information resources are available in public reposi- 
tories (Cochrane and Galperin, 2010). By combining heterogeneous information 
sources and utilizing the wealth of background information in public repositories, 
it is possible to solve some of the problems that are related to the statistical uncer- 
tainties and small sample size of individual data sets, as well as to form a holistic 
picture of the genome (Huttenhower and Hofmann, 2010). 

The observational data can provide the starting point to discover novel research 
hypotheses of poorly characterized large-scale systems; the analysis proceeds from 
general observations of the data toward more detailed investigations and hypothe- 
ses. This differs from traditional hypothesis testing where the investigation pro- 
ceeds from hypotheses to measurements that target particular research questions, 
in order to support or reject a given hypothesis. Exploratory data analysis refers 
to the use of computational tools to summarize and visualize the data in order to 
identify potentially interesting structure, and to facilitate the generation of new 
research hypotheses when the search space would be otherwise exhaustively large 
(Tukey, 1977). When the system is poorly characterized, there is a need for meth- 
ods that can adapt to the data and extract features in an automated way. This 
is useful since application-oriented models often require careful preprocessing of 
the data and a timely model fitting process. They may also require prior know- 
ledge of the investigated system, which is often not available. Statistical learning 
investigates solutions to these problems. 

1.1 Contributions and organization of the thesis 

This thesis introduces computational strategies for genome- and organism-wide 

analysis of the human transcriptome. The thesis provides novel tools (i) to in- 
crease the reliability of high-throughput microarray measurements by combining 
statistical evidence from genome sequence databases and across multiple microar- 
ray experiments, (ii) to model context-specific transcriptional activation patterns 
of genome-scale interaction networks across normal human body by using back- 
ground information of genetic interactions to guide the analysis, and (iii) to inte- 
grate measurements of the human transcriptome to other layers of genomic infor- 
mation with novel dependency modeling techniques for co-occurring data sources. 
The three strategies address widely recognized challenges in functional genomics 
(CoUins et al., 2003; Troyanskaya, 2005). 

Obtaining reliable measurements is the crucial starting point for any data anal- 
ysis task. The first contribution of this thesis is to develop computational strategies 
that utilize side information in genomic sequence and microarray data collections 
in order to reduce noise and improve the quality of high-throughput observations. 
Publication 1 introduces a probe-level strategy for microarray preprocessing, where 
updated genomic sequence databases are used in order to remove erroneously tar- 
geted probes to reduce measurement noise. The work is extended in Publication 2, 
which introduces a principled probabilistic framework for probe-level analysis. A 
generative model for probe-level observations combines evidence across multiple 
experiments, and allows the estimation of probe performance directly from mi- 
croarray measurements. The model detects a large number of unreliable probes 
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contaminated by known probe-level error sources, as well as many poorly per- 
forming probes where the source of contamination is unknown and could not be 
controlled based on existing probe-level information. The model provides a prin- 
cipled framework to incorporate prior information of probe performance. The 
introduced algorithms outperform widely used alternatives in differential gene ex- 
pression studies. 

A novel strategy for organism- wide analysis of transcriptional activity in genome- 
scale interaction networks in Publication 3 forms the second main contribution of 
this thesis. The method searches for local regions in a network exhibiting coordi- 
nated transcriptional response in a subset of conditions. Constraints derived from 
genomic interaction databases are used to focus the modeling on those parts of 
the data that arc supported by known or potential interactions between the genes. 
Nonparametric inference is used to detect a number of physiologically coherent 
and reproducible transcriptional responses, as well as context-specific regulation 
of the genes. The findings provide a global view on transcriptional activity in 
cell- biological networks and functional relatedness between tissues. 

The third contribution of the thesis is to integrate measurements of the human 
transcriptome to other layers of genomic information. Novel dependency modeling 
techniques for co-occurrence data are used to reveal regularities and interactions, 
which could not be detected in individual observations. The regularized depen- 
dency modeling framework of Publication 4 is used to detect associations between 
chromosomal mutations and transcriptional activity. Prior biological knowledge 
is used to constrain the latent variable model and shown to improve cancer gene 
detection performance. The associative clustering, introduced in Publications 5 
and 6, provides tools to investigate evolutionary divergence of transcriptional ac- 
tivity. 

Open source implementations of the key methodological contributions of this 
thesis have been released in order to guarantee wide access to the developed al- 
gorithmic tools and to comply with the emerging standards of transparency and 
reproducibility in computational science, where an increasing proportion of re- 
search details are embedded in code and data accompanying traditional publi- 
cations (Boulesteix, 2010; Carey and Stodden, 2010; loannidis et al., 2009) and 
transparent sharing of these resources can form valuable contributions to public 
knowledge (Sommer, 2010; Sonnenburg et al., 2007; Stodden, 2010). 

The thesis is organized as follows: In Chapter 2, there is an overview of func- 
tional genomics, related measurement techniques, and genomic data resources. 
General methodological background, in particular of exploratory data analysis and 
the probabilistic modeling paradigm, is provided in Chapter 3. The methodological 
contributions of the thesis are presented in Chapters 4-6. In Chapter 4, strategics 
to improve the reliability of high-throughput microarray measurements are pre- 
sented. In Chapter 5 methods for organism-wide analysis of the transcriptome are 
considered. In Chapter 6, two general-purpose algorithms for dependency model- 
ing are introduced and applied in investigating functional effects of chromosomal 
mutations and evolutionary divergence of transcriptional activity. The conclusions 
of the thesis are summarized in Chapter 7. 
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Chapter 2 

Functional genomics 



From all we have learnt about the structure of living matter, we must be 
prepared to find it working in a manner that cannot be reduced to the ordi- 
nary laws of physics - - because the construction is different from anything 
we have yet tested in the physical laboratory. 

E. Schrodinger (1956) 

Living organisms arc controlled not only by natural laws but also by inheritable 
genetic programs (Mayr, 2004; Schrodinger, 1944). Such double causation is a 
unique feature of life, and in fundamental contrast to purely physical processes 
of the inanimate world. Life may have emerged on earth more than 3.4 billion 
years ago (Schopf, 2006; Tice and Lowe, 2004). Genetic information evolves by 
means of natural selection (Darwin, 1859). Living organisms maintain homeostasis, 
adapt to changing environments, respond to external stimuli, and communicate. 
Peculiar features of living systems include metabolism, growth and hierarchical 
organization, as well as the ability to replicate and reproduce. All known life 
forms share fundamental mechanisms at molecular level, which suggests a common 
evolutionary origin of the living organisms. 

The complete collection of genetic material, the genome, encodes the herita- 
ble genetic program of an organism. Advances in measurement technology and 
computational science have opened up new views to the large-scale organization of 
the genome (Carroll, 2003; Lander, 1996). Functional genomics is a subdiscipline 
of molecular biology investigating the functional organization and properties of 
genetic information. In this thesis, new computational approaches are developed 
for investigation of a central functional layer of the genome of our own species, 
the human transcriptome. This chapter gives an overview to the relevant concepts 
in genome biology in eukaryotic organisms and associated genomic data resources. 
For further background in molecular genome biology, see Alberts et al. (2002); 
Brown (2006). 

2.1 Universal genetic code 

Cells are fundamental building blocks of living organisms. All known life forms 
maintain a carbon-based cellular form that carries the genetic program (Alberts 
et al., 2002). Each cell carries a copy of the heritable genetic code, the genome. 
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The human genome is divided in 23 pairs of chromosomes, located in the nu- 
cleus of the cell, as well as in additional mitochondrial genome. Chromosomes are 
macroscopic deoxyribonucleic acid (DNA) molecules in which the DNA is wrapped 
around histone molecules and packed into a peculiar chromatin structure that will 
ultimately constitute chromosomes. The genetic code in the DNA consists of four 
nucleotides: adenosine (A), thymine (T), guanine (G), and cytosine (C). In ribonu- 
cleic acid (RNA), the thymine is replaced by uracil (U). Ordering of the nucleotides 
carries genetic information. Nucleic acid sequences have a peculiar base pairing 
property, where only A-T/U and G-C pairs can hybridize with each other. This 
leads to the well-known double-stranded structure of the DNA, and forms the basis 
for cellular information processing. The central dogm.a of molecular biology (Crick, 
1970) states that DNA encodes the information to construct proteins through the 
irreversible process of protein synthesis. This is a central paradigm in molecular 
biology, describing the functional organization of life at the cellular level. 

2.1.1 Protein synthesis 

Genes are basic units of genetic information. The gene is a sequence of DNA that 
contains the information to manufacture a protein or a set of related proteins. 
Genetic variation and rcgiilation of gene activity has therefore major phenotypic 
consequences. The regulatory region and coding sequence are two key elements of 
a gene. The regulatory region regulates gene activity, while the coding sequence 
carries the instructions for protein synthesis (Alberts et al., 2002). Interestingly, 
the concept of a gene remains controversial despite comprehensive identification 
of the protein-coding genes in the human genome and detailed knowledge of their 
structure and fimction (Pearson, 2006). 

Proteins, encoded by the genes, are key functional entities in the cell. They 
form cellular structures, and participate in cell signaling and functional regulation. 
Protein synthesis refers to the cell-biological process that converts genetic informa- 
tion into final functional protein products (Figure ??A). Key steps in protein syn- 
thesis include transcription, pre-mRNA splicing, and translation. In transcription, 
the double-stranded DNA is opened in a proximity of the gene scqiicnce and the 
process is initiated on the regulatory region of the gene. The DNA sequence of the 
gene is then converted into a complementary pre-mRNA by a polymerase enzyme. 
The pre-mRNA sequence contains both protein coding and non-coding segments. 
These are called exons and introns, respectively. In pre-mRNA splicing, the introns 
are removed and the exons are joined together to form mature messenger-RNA 
(mRNA). A gene can encode multiple splice variants, corresponding to different 
exon definitions and their combinations; this is called alternative splicing. The 
mature mRNA is exported from nucleus to the cell cytoplasm. In translation the 
mRNA is converted into a corresponding amino acid sequence in ribosomcs based 
on the universal genetic code that defines a mapping between nucleic acid triplets, 
so-called codons, and amino acids. The code is common for all known life forms. 
Each consecutive codon on the mRNA sequence corresponds to an amino acid, 
and the corresponding sequence of amino acids constitutes a protein. In the final 
stage of protein synthesis, the amino acid sequence folds into a three-dimensional 
structure and undergoes post-translational modifications. The structural charac- 
teristics of a protein molecule will ultimately determine its functional properties 
(Alberts et al., 2002). 
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Figure 2.1: A Key steps of protein synthesis. The two key processes in protein synthesis are 
called transcription and translation, respectively. In transcription, the DNA sequence of the gene 
is transcribed into pre-mRNA based on the base pairing property of nucleic acid sequences. The 
pre-mRNA is modified to produce mature messenger- RNA (mRNA), which is then transported 
to cytoplasm. Transfer-RNA (tRNA) carries the mRNA to ribosomes, where it is translated 
into an amino acid sequence based on the universal genetic code where each nucleotide triplet 
of the mRNA sequence, so-called codon, corresponds to a particular amino acid. The amino 
acid sequence is subsequently modified to form the final functional protein product. B Or- 
ganization of the genetic material in an eukaryotic cell. The nucleotide base pairs form the 
double helix structure of DNA. This is wrapped around histone molecules to form nucleosomes, 
and the chromatin sequence. The chromatin is tightly packed to form chromosomes that carry 
the genetic material and are located in the cell nucleus. The image has been modified from 
http: / /commons. wikimedia.org/wiki/File:Chromosome_en.svg. 



2.1.2 Layers of regulation 

Phenotypic changes can rarely be attributed to changes in individual genes; cell 
function is ultimately determined by coordinated activation of genes and other bio- 
molecular entities in response to changes in cell-biological environment (Hartwell 
et al., 1999). Gene activity is regulated at all levels of protein synthesis and cellu- 
lar processes. A major portion of functional genome sequence and protein coding- 
genes themselves participate in the regulatory system itself (Lauffenburger, 2000). 

Epigenetic regulation refers to chemical and structural modifications of chro- 
mosomal DNA, the chromatin, for instance through methylation, acetylation, and 
other histone-binding molecules. Such modifications affect the packing of the DNA 
molecule around histones in the cell nucleus. The combinatorial regulation of such 
modifications regulates access to the gene sequences (Gibney and Nolan, 2010). 
Epigenetic changes are believed to be heritable and they constitute a major source 
of variation at individual and population level (Johnson and Tricker, 2010). Tran- 
scriptional regulation is the next major regulatory layer in protein synthesis. So- 
called transcription factor proteins can regulate the transcription rate by binding 
to control elements in gene regulatory region in a combinatorial fashion. Post- 
trans criptional modifications will then regulate pre-mRNA splicing. Up to 95% 
of human multi-exon genes are estimated to have alternative splice variants (Pan 
et al., 2008). Consequently, a variety of related proteins can be encoded by a single 
gene. This contributes to the structural and functional diversity of cell function 
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(Stetefeld and Ruegg, 2005). Several mechanisms will then affect mRNA degra- 
dation rates. For instance, micro-RNAs that are small, 21-25 basepair nucleotide 
sequences can inactivate specific mRNA transcripts through complementary base 
pairing, leading to mRNA degradation, or prevention of translation. Finally, post- 
translational modifications, protein degradation, and other mechanisms will affect 
the three-dimensional structure and life cycle of a protein. The proteins will par- 
ticipate in further cell-biological processes. The processes are in continuous inter- 
action and form complex functional networks, which regulate the life processes of 
an organism (Alberts et al., 2002). 

2.2 Organization of genetic information 

The understanding of the structure and functional organization of the genome 

is rapidly accumulating with the developing genome-scanning technologies and 
computational methods. This section provides an overview to key structural and 
functional layers of the human genome. 

2.2.1 Genome structure 

The genome is a dynamic structure, organized and regulated at multiple levels of 
resolution from individual nucleotide base pairs to complete chromosomes (Fig- 
ure ??B; Brown (2006)). A major portion of heritable variation between individu- 
als has been attributed to differences in the genomic DNA sequence. Traditionally, 
main genetic variation was believed to arise from small point mutations, so-called 
single-nucleotide polymorphisms (SNPs), in protein-coding DNA. Recently, it has 
been increasingly recognized that structural variation of the genome makes a re- 
markable contribution to genetic variation. Structural variation is observed at all 
levels of organization from single-nucleotide polymorphisms to large chromosomal 
rearrangements, including deletions, insertions, duplications, copy-number vari- 
ants, inversions and translocations of genomic regions (Feuk et al., 2006; Sharp 
et al., 2006). Such modifications can directly and indirectly influence transcript- 
ional activity and contribute to human diversity and health (Collins et al., 2003; 
Hurles et al, 2008). 

The draft DNA sequence of the complete human genome was published in 2001 
(International human genome sequencing consortium, 2001; Venter et al., 2001). 
The human genome contains three billion base pairs and approximately 20,000- 
25,000 protein-coding genes (International Human Genome Sequencing Consor- 
tium, 2004). The protein-coding exons comprise less than 1.5% of the human 
genome sequence. Approximately 5% of the human genome sequence has been 
conserved in evolution for more than 200 million years, including the majority of 
protein-coding genes (The ENCODE Project Consortium, 2007; Mouse Genome 
Sequencing Consortium, 2002). Half of the genome consists of highly repetitive 
sequences. The genome sequence contains structural elements such as centromeres 
and telomeres, repetitive and mobile elements, (Prak and Kazazian Jr., 2000), 
retroelements (Bannert and Kurth, 2004), and non-coding, non- repetitive DNA 
(Collins et al., 2003). The functional role of intergenic DNA, which forms 75% of 
the genome, is to a large extent unknown (Venter et al., 2001). Recent evidence 
suggests that the three-dimensional organization of the chromosomes, which is to 
a large extent regulated by the intergenic DNA is under active selection, can have 
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a remarkable regulatory role (Lieberman-Aiden et al., 2009; Parker et al., 2009). 
Comparison of the human genome with other organisms, such as the mouse (Mouse 
Genome Sequencing Consortium, 2002) can highlight important evolutionary dif- 
ferences between species. For a comprehensive review of the structural properties 
of the human genome, see Brown (2006). 

2.2.2 Genome function 

In protein synthesis, the gene sequence is transcribed into prc-mRNA, which is 
then further modified into mature messenger-RNA and transported to cytoplasm. 
An average cell contains over 300,000 mRNA molecules, and the mRNA concen- 
tration, or expression levels of individual genes, vary according to Zipf's law, a 
power-law distribution where most genes are expressed at low concentrations, per- 
haps only one or few copies of the mRNA per cell on average, and a small number 
of genes arc highly expressed, potentially with thousands of copies per cell (sec 
Carninci, 2009; Furusawa and Kaneko, 2003). Cell- biological processes are re- 
flected at the transcriptional level. Transcriptional activity varies by cell type, 
environmental conditions and time. Different collections of genes are active in 
different contexts. Gene expression, or mRNA expression, refers to the expression 
level of an mRNA transcript at particular physiological condition and time point. 
In addition to protein-coding mRNA molecules that are the main target of analy- 
sis in this thesis, the cell contains a variety of other functional and non-functional 
mRNA transcripts, for instance micro-RNAs, ribosomal RNA and transfer-RNA 
molecules (Carninci, 2009; Johnson et al., 2005). 

The transcriptome refers to the complete collection of mRNA sequences of an 
organism. This is a central functional layer of the genome that regulates protein 
production in the cells, with a significant role in creating genetic variation (Jordan 
et al., 2005). According to current estimates, up to 90% of the eukaryotic genome 
can be transcribed (Consortium, 2005; Gagneur et al., 2009). The protein-coding 
mRNA transcripts are translated into proteins at ribosomes during protein syn- 
thesis. 

The proteome refers to the collection of protein products of an organism. The 
protcomc is a main functional layer of the genome. Since the final protein products 
carry out a main portion of the actual cell functions, techniques for monitoring 
the concentrations of all proteins and their modified forms in a cell simultane- 
ously would significantly help to improve the understanding of the cellular systems 
(Collins et al., 2003). However, sensitive, reliable and cost-efficient genome- wide 
screening techniques for measuring protein expression are currently not available. 
Therefore genome-wide measurements of the mRNA expression levels are often 
used as an indirect estimate of protein activity. 

In addition to the DNA, RNA and proteins, the cell contains a variety of other 
small molecules. The extreme functional diversity of living organisms emerges from 
the complex network of interactions between the biomolecular entities (Barabasi 
and Oltvai, 2004; Hartwell et al., 1999). Understanding of these networks and their 
functional properties is crucial in understanding cell function (Collins et al., 2003; 
Schadt, 2009). However, the systemic properties of the interactome are poorly 
characterized and understood due to the complexity of biological phenomena and 
incomplete information concerning the interactions. The cell-biological processes 
are inherently modular (Hartwell et al., 1999; Ihmels et al., 2002; Lauffenburger, 
2000), and they exhibit complex pathway cross-talk between the cell-biological 
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processes (Li ct al., 2008). In modular systems, small changes can have significant 
regulatory effects (Espinosa-Soto and Wagner, 2010). 

2.3 Genomic data resources 

Systematic observations from the various functional and regulatory layers of the 
genome are needed to understand cell-biological systems. Efficient sharing and 
integration of genomic information resources through digital media has enabled 
large-scale investigations that no single institution could afford. The public human 
genome sequencing project (International human genome sequencing consortium, 
2001) is a prime example of such project. Results from genome-wide transcriptio- 
nal profiling studies are routinely deposited to public repositories (Barrett et al., 
2009: Parkinson et al., 2009). Sharing of original data is increasingly accepted as 
the scientific norm, often following explicit data release policies. The establishment 
of large-scale databases and standards for representing biological information sup- 
port the efficient use of these resources (Bammler et al., 2005; Brazma et al., 2006). 
A continuously increasing array of genomic information is available in these data- 
bases, concerning aspects of genomic variability across individuals, disease states, 
and species (Brent, 2008; Church, 2005; Cochrane and Galperin, 2010; GIOKCOS 
consortium, 2009; The Cancer Genome Atlas Research Network, 2008). 

2.3.1 Community databases and evolving biological know- 
ledge 

Genomic sequence databases 

During the human genome project and preceding sequencing projects DNA se- 
quence reads were among the first sources of biological data that were collected 
in large-scale public repositories, such as GenBank (Benson et al., 2010). Gen- 
Bank contains comprehensive sequence information of genomic DNA and RNA 
for a number of organisms, as well as a variety of information concerning the 
genes, non-coding regions, disease associations, variation and other genomic fea- 
tures. Online analysis tools, such as the Ensembl Genome browser (Flicek et al., 
2010), facilitate efficient use of these annotation resources. Next-generation se- 
quencing technologies provide rapidly increasing sequencing capacity to investigate 
sequence variation between individuals, populations and disease states (Ledford, 
2010; McPherson, 2009). In particular, the human and mouse transcriptome se- 
quence collections at the Entrez Nucleotide database of GenBank are utilized in 
this thesis, in Publications 1 and 2. 

Transcriptome databases 

Gene expression measurement provides a snapshot of mRNA transcript levels in a 
cell population at a specific time and condition, reflecting the activation patterns 
of the various cell-biological processes. While gene expression measurements pro- 
vide only an indirect view to cellular processes, their wide availability provides a 
unique resource for investigating gene co-regulation on a genome- and organism- 
wide scale. Versatile collections of microarray data in public repositories, such 
as the Gene Expression Omnibus (GEO; Barrett et al. (2009)) and ArrayExpress 
(Parkinson et al., 2009) are available for human and model organisms, and they 
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contain valuable information of cell function (Consortium, 2005; DeRisi et al., 
1997; Russ and Futschik, 2010; Zhang et al., 2004). 

Several techniques are available for quantitative and highly parallel measure- 
ments of mRNA or gene expression, allowing the measurement of the expression 
levels of tens of thousands of mRNA transcripts simultaneously (Bradford et al., 
2010). Microarray techniques are routinely used to measure the expression levels 
of tens of thousands of mRNA transcripts in a given sample, and transcriptio- 
nal profiling is currently a main high-throughput technique used to investigate 
gene function at genome- and organism- wide scale (Gershon, 2005; Yauk et al., 
2004). Increasing amounts of transcriptional profiling data arc being produced 
by sequencing- based methods (Carninci, 2009). A main difference between the 
microarray- and sequencing-based techniques is that gene expression arrays have 
been designed to measure predefined mRNA transcripts, whereas sequencing-based 
methods do not require prior information of the measured sequences, and enable 
de novo discovery of expressed transcripts (Bradford et al., 2010; 't Hoen et al., 
2008). Large-scale microarray repositories provide currently the most mature tools 
for data processing and retrieval, and form the main source of transcriptome data 
in this thesis. 

Microarray technology is based on the base pairing property of nucleic acid se- 
quences where the DNA or RNA sequences in a sample bind to the complementary 
nucleotide sequences on the array. This is called hybridization. The measurement 
process begins by the collection of cell samples and isolation of the sample mRNA. 
The isolated mRNA is converted to cDNA, labeled with specific marker molecules, 
and hybridized on complementary probe sequences on the array. The array sur- 
face may contain hundreds of thousands of spots, each containing specific probe 
sequences designed to uniquely match with particular mRNA sequences. The hy- 
bridization level refiects the target mRNA concentration in the sample, and it is 
estimated by measuring the intensity of light emitted by the label molecules with 
a laser scanner. Short oligonucleotide arrays (Lockhart et al., 1996) are among 
the most widely used microarray technologies, and they are the main source of 
mRNA expression data in this thesis. Short oligonucleotide arrays utilize multi- 
ple, typically 10-20, probes for each transcript target that bind to different regions 
of the same transcript sequence. Use of several 25-nucleotide probes for each target 
leads to more robust estimates of transcript activity. Each probe is expected to 
uniquely hybridize with its intended target, and the detected hybridization level is 
used as a measure of the activity of the transcript. A short oligonucleotide array 
measures absolute expression levels of the mRNA sequences; relative differences 
between conditions can be investigated afterwards by comparing these measure- 
ments. A standard whole-genome array measures typically ~20, 000-50, 000 unique 
transcript sequences. A single microarray experiment can therefore produce hun- 
dreds of thousands of raw observations. 

Comparison and integration of individual microarray experiments is often chal- 
lenging due to remarkable experimental variation between the experiments. Com- 
mon standards have been developed to advance the comparison and integration 
(Brazma et al., 2001, 2006). Carefully controlled integrative datasets, so-called 
gene expression atlases, contain thousands of genome-wide measurements of tran- 
scriptional activity across diverse conditions in a directly comparable format. Ex- 
amples of such data collections include GeneSapiens (Kilpinen et al., 2008), the 
human gene expression atlas of the European Bioinformatics Institute (Lukk et al.. 
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2010), as well as the NCI-60 cell line panel (Scherf et al., 2000). Integrative anal- 
ysis of large and versatile transcriptome collections can provide a holistic view of 
transcriptional activity of the various cell-biological processes, and opens up possi- 
bilities to discover previously uncharacterized cellular mechanisms that contribute 
to human health and disease. 

Other types of microarray data 

Microarray techniques can also be; used to study other functional aspects of the 
genome, including cpigcnctics and micro-RNA regulation, chromosomal aberra- 
tions and polymorphisms, alternative splicing, as well as transcription factor bind- 
ing (Butte, 2002; Hoheisel, 2006). For instance, chromosomal aberrations can be 
measured with the array comparative genome hybridization method (aCGH; Pinkel 
and Albertson 2005), which is based on hybridization of DNA sequences on the 
array surface. Copy number changes are a particular type of chromosomal aber- 
rations, which are a major mechanism for cancer development and progression. 
Copy number alterations can cause changes in gene- and micro-RNA expression, 
and ultimately cell-biological processes (Beroukhim et al., 2010). A public reposi- 
tory of copy number measurement data is provided for instance by the CanGEM 
database (Scheinin et al., 2008). In Publication 4, microarray measurements of 
DNA copy number changes are integrated with transcriptional profiling data to 
discover potential cancer genes for further biomedical analysis. 

Pathway and interaction databcises 

Curated information concerning cell-biological processes is valuable in both exper- 
imental design and validation of computational studies (Blake, 2004). Represen- 
tation of dynamic biochemical reactions in their full richness is a challenging task 
beyond a mere listing of biochemical events; a variety of proteins and other com- 
pounds interact in a hierarchical manner through various molecular mechanisms 
(Hartwell et al., 1999; Przytycka et al., 2010). Standardized database formats 
such as the BioPAX (BioPAX workgroup, 2005) and SBML (Stromback and Lam- 
brix, 2005) advance the accumulation of highly structured biological knowledge 
and automated analysis of such data. A huge body of information concerning 
ccU-biological processes is available in public repositories. The most widely used 
annotation resources include the Gene Ontology (GO) database (Ashburner et al., 
2000) and the KEGG pathway database (Kanehisa ct al., 2010). The GO database 
provides functional annotations for genes and can be nacd for instance to detect 
enrichment of certain functional categories among the key findings from compu- 
tational analysis, as in Publication 6, where enrichment analysis is used for both 
validation and interpretation piirposcs. Pathways arc more structured represen- 
tations concerning cellular processes and interactions between molecular entities. 
Such prior information can be used to guide computational modeling, as in Pub- 
lication 3, where pathway information derived from the KEGG pathway database 
is used to guide organism-wide discovery and analysis of transcriptional response 
patterns. 

Evolving biological knowledge 

The collective knowledge about genome organization and function is constantly 
updated and refined by improved measurement techniques and accumulation of 
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data (Sebat, 2007). This can alter the analysis and interpretation of results from 
large-scale genomic screens. For instance, evolving gene and transcript defini- 
tions are known to significantly affect microarray interpretation. Probe design 
on microarray technology relies on sequence annotations that may have changed 
significantly after the original array design. Reinterpretation of microarray data 
based on updated probe annotations has been shown to improve the accuracy 
and comparability of microarray results (Dai et al., 2005; Hwang et al., 2004; 
Mecham et al., 2004b). Bioinformatics studies routinely take into account updates 
in genome version, genome build, in new analyses. The constantly refined bio- 
logical data highlights the need to account for this uncertainty in computational 
analyses. In Publications 1 and 2, explicit computational strategies that are robust 
against evolving transcript definitions are developed for microarray data analysis. 

2.3.2 Challenges in high-throughput data analysis 

High-throughput genetic screens are inherently noisy. Controlling all potential 
sources of variation in the measurement process is increasingly difficult when au- 
tomated measurement techniques can produce millions of data points in a single 
experiment, concerning extremely complex living systems that are to a large extent 
poorly understood. 

Noise arises from both technical and biological sources (Butte, 2002), and sys- 
tematic variation between laboratories, measurement batches and measurement 
platforms has to be taken into account when combining the results across individ- 
ual studies (Hcber and Sick, 2006; MAQC Consortium, 2006). Moreover, genomic 
knowledge is constantly evolving, which can potentially change the interpretation 
of previous experiments (see e.g. Dai et al., 2005). The various sources of noise 
and uncertainty in microarray studies are discussed in more detail in Chapter 4. 

High dimensionality of the data and small sample size form another challenge 
for the analysis of high-throughput functional genomics data. Tens of thousands 
of transcripts can be measured simultaneously in a single microarray experiment, 
which greatly exceeds the number of available samples in most biomedical studies. 
Small sample sizes leave considerable uncertainty in the analyses; few observations 
contain very limited information conec;rning the complex and high-dimensional 
phenomena and potential interactions between different parts of the system. Over- 
fitting of the models and the problem of multiple testing forms considerable chal- 
lenges in such situations. While automated analysis methods can generate thou- 
sands of hypotheses concerning the system, prioritizing the findings and charac- 
terizing uncertainty in the predictions become central issues in the analysis. The 
curse of dimensionality, coupled with the high levels of noise in functional genomics 
studies, is therefore posing particular challenges for computational modeling (Saeys 
et al., 2007). 

The challenges in controlling the various soiirces of uncertainty have led to 
remarkable problems in reproducing microarray results (loannidis et al., 2009), 
but maturing technology and the development of common standards and analyti- 
cal procedures are constantly improving the reliability of high-throughput screens 
(Allison et al, 2006; Reimers, 2010; MAQC Consortium, 2006). The models de- 
veloped in this thesis combine statistical evidence across related experiments to 
improve the reliability of the analysis and to increase modeling power. Gener- 
ative probabilistic models provide a rigorous framework for handling noise and 
uncertainty in the data and models. 
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2.4 Genomics and health 

Genomic variation between individuals has remarkable and to a large extent un- 
known contribution to health and disease susceptibility. Large-scale characteriza- 
tion of the variability between individuals and populations is expected to elucidate 
genomic mechanisms associated with disease, as well as to lead to the discovery 
of novel medical treatments. High-throughput genomics can provide new tools 
to understand disease mechanisms (Braga-Neto and Marques, 2006; Lage et al., 
2008), to 'hack the genome' (Evanko, 2006) to treat diseases (Volinia et al., 2010), 
and to guide personalized therapies that take into account the individual variabil- 
ity in sensitivity and responses to treatments (Church, 2005; Downward, 2006; 
Foekens et al., 2008; Ocana and Pandiella, 2010; van 't Veer and Bernards, 2008). 
Disease signatures are potentially robust across tissues and experiments (Dudley 
et al., 2009; Hu et al., 2006). Genomic screens have revealed new disease subtypes 
(Bhattacharjee ct al., 2001), and led to the discovery of various diagnostic (Lee 
et al., 2008; Su et al., 2009; Tibshirani et al., 2002) and prognostic (Beer et al., 
2002) biomarkers. Diseases cause coordinated changes in gene activity through 
biomolecular networks (Cabusora et al., 2005). Integration of chemical, genomic 
and pharmacological functional genomics data can also help to predict new drug 
targets and responses (Lamb et al., 2006; Yamanishi et al., 2010). Genomic mu- 
tations can also affect genome function and cause diseases (Taylor et al., 2008). 
Cancer is an example of a prevalent genomic disease. Boveri (1914) discovered that 
cancer cells have chromosomal imbalances, and since then the understanding of ge- 
nomic changes associated with cancer has continuously improved (Stratton et al., 
2009; Wunderlich, 2007). For instance, many human micro- RNA genes are located 
at cancer-associated genomic regions and are functionally altered in cancers (see 
Calin and Croce, 2006). Genomic changes also affect transcriptional activity of the 
genes (Myllykangas et al., 2008). Publication 4 introduces a novel computational 
approach for screening cancer-associated DNA mutations with functional implica- 
tions by genome-wide integration of chromosomal aberrations and transcriptional 
activity. 

This chapter has provided an overview to central modeling challenges and re- 
search topics in functional genomics. In the following chapters, particular method- 
ological approaches are introduced to solve research tasks in large-scale analysis 
of the human transcriptome. In particular, methods are introduced to increase 
the reliability of high-throughput measurements, to model large-scale collections 
of transcriptome data and to integrate transcriptional profiling data to other lay- 
ers of genomic information. The next chapter provides general methodological 
background for these studies. 
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Statistical learning and 
exploratory data analysis 

Essentially, all models are wrong, but some are useful. 

G.E.P. Box and N.R. Draper (1987) 

Models are condensed, simplified representations of observed phenomena. Mod- 
els can be used to describe observations and to predict future events. Two key 
aspects in modeling are the construction and learning of formal representations 
of the observed data. Complex real-world observations contain large amounts of 
uncontrolled variation, which is often called noise; all aspects of the data cannot 
be described within a single model. Therefore, a modeling compromise is needed 
to decide what aspects of data to describe and what to ignore. The second step 
in modeling is to fill in, to learn, details of the formal representation based on the 
actual empirical observations. Various learning algorithms are typically available 
that differ in efficiency and accuracy. For instance, improvements in computation 
time can often be achieved by potential decrease in accuracy. An inference com- 
promise is needed to decide how to balance between these and other potentially 
conflicting objectives of the learning algorithm; the relative importance of each 
factor depends on the particular application and available resources, and affects 
the choice of the learning procedure. The modeling and inference compromises are 
at the heart of data analysis. Ultimately, the value of a model is determined by 
its ability to advance the solving of practical problems. 

This chapter gives an overview of the; key concepts in statistical modeling cen- 
tral to the topics of this thesis. The objectives of exploratory data analysis and 
statistical learning are considered in Section 3.1. The methodological framework is 
introduced in Section 3.2, which contains an overview of central concepts in proba- 
bilistic modeling and the Bayesian analysis paradigm. Key issues in implementing 
and validating the models are discussed in Section 3.3. 

3.1 Modeling tasks 

Understanding requires generalization beyond particular observations. While em- 
pirical observations contain information of the underlying process that generated 
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the data, a major challenge in computational modeling is that empirical data is al- 
ways finite and contains only limited information of the system. Traditional statis- 
tical models are based on careful hypothesis formulation and systematic collection 
of data to support or reject a given hypothesis. However, successful hypothesis 
formulation may require substantial prior knowledge. When minimal knowledge of 
the system is available, there is a need for exploratory methods that can recognize 
complex patterns and extract features from empirical data in an automated way 
(Baldi and Brunak, 1999). This is a central challenge in computational biology, 
where the investigated systems are extremely complex and contain large amounts 
of poorly characterized and uncontrolled sources of variation. Moreover, the data 
of genomic systems is often very limited and incomplete. General-purpose algo- 
rithms that can learn relevant fciaturcs from the data with minimal assumptions arc 
therefore needed, and they provide valuable tools in functional genomics studies. 
Classical examples of such exploratory methods include clustering, classification 
and visualization techniques. The extracted features can provide hypotheses for 
more detailed experimental testing and reveal new, unexpected findings. In this 
work, general-purpose exploratory tools are developed for central modeling tasks 
in functional genomics. 

3.1.1 Central concepts in data analysis 

Let us start by defining some of the basic concepts and terminology. Data set in 
this thesis refers to a finite collection of observations, or samples. In experimental 
studies, as in biology, a sample typically refers to the particular object of study, 
for instance a patient or a tissue sample. In computational studies, sample refers 
to a numerical observation, or a subset of observations, represented by a numerical 
feature vector. Each element of the feature vector describes a particular feature of 
the observation. Given D features and N samples, the data set is presented as a 
matrix X e R^^^ , where each column vector x e represents a sample and each 
row corresponds to a particular featiuc. The features can represent for instance 
different experimental conditions, time points, or particular summaries about the 
observations. This is the general structure of the observations investigated in this 
work. 

The observations are modeled in terms of probability densities; the samples are 
modeled as independent instances of a random variable. A central modeling task is 
to characterize the underlying probability density of the observations, p{x). This 
defines a topology in the sample space and provides the basis for generalization 
beyond empirical observations. As explained in more detail in Section 3.2, the 
models are formulated in terms of observations X, model parameters 6, and latent 
variables Z that are not directly observed, but characterize the underlying process 
that generated the data. 

Ultimately, all models describe relationships between objects. Similarity is 
therefore a key concept in data analysis; the basis for characterizing the relations, 
for summarizing the observations, and for predicting future events. Measures of 
similarity can be defined for different classes of objects such as feature vectors, 
data sets, or random variables. Similarity in general is a vague concept. Euclidean 
distance, induced by the Euclidean metrics, is a common (dis-)similarity mea- 
sure for multivariate observations. Correlation is a standard choice for univariate 
random variables. Mutual information is an information-theoretic measure of sta- 
tistical dependency between two random variables, characterizing the decrease in 
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the uncertainty concerning the realization of one variable, given the other one. The 
uncertainty of a random variable X is measured in terms of entropy^ (Shannon, 
1948). The mutual information between two random variables is then given by 
IiX,y) = H{X) - H{X\y) (see e.g. Gelman et al, 2003). The Kullback-Leibler 
divergence, or KL-divergence, is a closely related non-symmetric dissimilarity mea- 
sure for probability distributions p, q, defined as dxhip, o) = p(x) log ^^dx (see 
e.g. Bishop, 2006). Mutual information between two random variables can be al- 
ternatively formulated as the KL-divergence between their joint density p(x, y) 
and the product of their independent marginal densities, g(x, y) = Px{^)Py{y), 
which gives the connection I(X,y) = dKL{p{^,y),Px{^)Py{y))- Mutual informa- 
tion and KL-divergence are central information-theoretic measures of dependency 
employed in the models of this thesis. 

It is important to notice that measures of similarity are inherently coupled 
to the statistical representation of data and to the goals of the analysis; differ- 
ent representations can reveal different relationships between observations. For 
instance, the Euclidean distance is sensitive to scaling of the features; represen- 
tation in natural or logarithmic scale, or with different units can potentially lead 
to very different analysis results. Not all measures are equally sensitive; mutual 
information can naturally detect non-linear relationships, and it is invariant to 
the scale of the variables. On the other hand, estimating mutual information is 
computationally demanding. 

Feature selection refers to computational techniques for selecting, scaling and 
transforming the data into a suitable form for further analysis. Feature selection 
has a central role in data analysis, and it is implicitly present in all analysis tasks 
in selecting the investigated features for the analysis. 

There are no universally optimal stand-alone feature selection techniques, since 
the problem is inherently entangled with the analysis task and multiple equally op- 
timal feature sets may be available for instance in classification or prediction tasks 
Guyon and Elisseeff (2003); Saeys et al. (2007). Successful feature selection can re- 
duce the dimensionality of the data with minimal loss of relevant information, and 
focus the analysis on particular features. This can reduce model complexity, which 
is expected to yield more efRcient, generalizable and interpretable models. Feature 
selection is particularly important in genome-wide profiling studies, where the di- 
mensionality of the data is large compared to the number of available samples, and 
only a small number of features are relevant for the studied phenomenon. This is 
also known as the large p, small n problem (West, 2003). Advanced feature selec- 
tion techniques can take into account dependencies between the features, consider 
weighted combinations of them, and can be designed to interact with the more 
general modeling task, as for instance in the nearest shrunken centroids classifier 
of Tibshirani et al. (2002). The constrained subspace clustering model of Publi- 
cation 3 can be viewed as a feature selection procedure, where high-dimensional 
genomic observations are decomposed into distinct feature subsets, each of which 
reveals different relationships of the samples. In Publication 4, identification of 
maximally informative features between two data sets forms a central part of a 
regularized dependency modeling framework. In Publications 3-4 the procedure 
and representations are motivated by biological reasoning and analysis goals. 



^Entropy is defined as H{X) = — J'^p(x) logp(x)(ix for a continuous variable. 
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3.1.2 Exploratory data analysis 

Exploratory data analysis refers to the use of computational techniques to sum- 
marize and visuahze data in order to facilitate the generation of new hypotheses 
for further study when the search space would be otherwise exhaustively large 
(Tukey, 1977). The analysis strategy takes the observations as the starting point 
for discovering interesting regularities and novel research hypotheses for poorly 
characterized large-scale systems without prior knowledge. The analysis can then 
proceed from general observations of the data toward confirmatory data analysis, 
more detailed investigations and hypotheses that can be tested in independent 
data sets with standard statistical procedures. Exploratory data analysis differs 
from traditional hypothesis testing where the hypothesis is given. Light-weight ex- 
ploratory tools are particularly useful with large data sets when prior knowledge on 
the system is minimal. Standard (exploratory approaches in computational biology 
include for instance clustering, classification and visualization techniques (Evanko, 
2010; Polanski and Kimmel, 2007). 

Cluster analysis refers to a versatile family of methods that partition data into 
internally homogeneous groups of similar data points, and often at the same time 
minimize the similarity between distinct clusters. Clustering techniques enable 
class discovery from the data. This differs from classification where the target 
is to assign new observations into known classes. The partitions provided by 
clustering can be nested, partially overlapping or mutually exclusive, and many 
clustering methods generalize the partitioning to cover previously unseen data 
points (Jain and Dubes, 1988). Clustering can provide compressed representations 
of the data based on a shared parametric representation of the observations within 
each cluster, as for instance in K-means or Gaussian mixture modeling (see e.g. 
Bishop, 2006). Certain clustering approaches, such as the hierarchical clustering 
(see e.g. Hastie et al., 2009), apply recursive schemes that partition the data into 
internally homogeneous groups without providing a parametric representation of 
the clusters. Cluster structure can be also discovered by linear algebraic operations 
on the distance matrices, as for instance in spectral clustering. The different 
approaches often have close theoretical connections. Clustering in general is an ill- 
defined concept that refers to a set of related but mutually incompatible objectives 
(Ben-David and Ackerman, 2008; Kleinberg, 2002). Cluster analysis has been 
tremendously popular in computational biology, and a comprehensive review of the 
different applications arc beyond the scope of this thesis. It has been observed, for 
instance, that genes with related functions have often similar expression profiles 
and are clustered together, suggesting that clustering can be used to formulate 
hypotheses concerning the function of previously uncharacterized genes (DeRisi 
et al., 1997; Eisen et al., 1998), or to discover novel cancer subtypes with biomedical 
implications (S0rlie et al., 2001). 

Visualization techniques are another widely used exploratory approach in com- 
putational biology. Visualizations can provide compact and intuitive summaries 
of complex, high-dimensional observations on a lower-dimensional display, for in- 
stance by linear projection methods such as principal component analysis, or by 
explicitly optimizing a lower-dimensional representation as in the self-organizing 
map (Kohonen, 1982). Visualization can provide the first step in investigating 
large data sets (Evanko, 2010). 
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3.1.3 Statistical learning 

Statistical learning refers to computational models that can learn to recognize 
structure and patterns from empirical data in an automated way. Unsupervised 
and supervised models form two main categories of learning algorithms. 

Unsupervised learning approaches seek compact descriptions of the data with- 
out prior knowledge. In probabilistic modeling, unsupervised learning can be for- 
mulated as the task of finding a probability distribution that describes the observed 
data and generalizes to new observations. This is also called density estimation. 
The parameter values of the model can be used to provide compact representa- 
tions of the data. Examples of unsupervised analysis tasks include methods for 
clustering, visualization and dimensionality reduction. In cluster analysis, groups 
of similar observations are sought from the data. Dimensionality reduction tech- 
niques provide compact lower-dimensional representations of the original data, 
which is often useful for subsequent modeling steps. Not all observations of the 
data are equally valuable, and assessing the relevance of the observed regularities 
is problematic in fully unsupervised analysis. 

In supervised learning the task is to learn a function that maps the inputs x 
to the desired outputs y based on a set of training examples in a generalizable 
fashion, as in regression for continuous outputs, and classification for discrete 
output variables. The supervised learning tasks arc inherently asymmetric; the 
inference proceeds from inputs to outputs, and prior information of the modeling 
task is used to supervise the analysis; the training examples also include a desired 
output of the model. 

The models developed in this thesis can be viewed as unsupervised exploratory 
techniques. However, the distinction between supervised and unsupervised models 
is not strict, and the models in this thesis borrow ideas from both categories. The 
models in Publications 2-3 are unsupervised algorithms that utilize prior infor- 
mation derived from background databases to guide the modeling by constraining 
the solutions. However, since no desired outputs are available for these models, 
the modeling tasks differ from supervised analysis. The dependency modeling al- 
gorithms of Publications 4-6 have close theoretical connections to the supervised 
learning task. In contrast to supervised learning, the learning task in these algo- 
rithms is symmetric; modeling of the co-occurring data sets is unsupervised, but 
coupled. Each data set affects the modeling of the other data set in a symmet- 
ric fashion, and, in analogy to supervised learning, prediction can then proceed 
to either direction. Compared to supervised analysis tasks, the emphasis in the 
dependency detection algorithms introduced in this thesis is in the discovery and 
characterization of symmetric dependencies, rather than in the construction of 
asymmetric predictive models. 

3.2 Probabilistic modeling paradigm 

The main contributions of this thesis follow the generative probabilistic modeling 
paradigm. Generative probabilistic models describe the observed data in terms 
of probability distributions. This allows the calculation of expectations, variances 
and other standard summaries of the model parameters, and at the same time 
allows to describe the independence assumptions and relations between variables, 
and uncertainty in the modeling process in an explicit manner. Measurements 
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are regarded as noisy observations of the general, underlying processes; generative 
models are used to characterize the processes that generated the observations. 

The first task in modeling is the selection of a m,odel family - a set of potential 
formal representations of the data. As discussed in Section 3.2.2, the representa- 
tions can also to some extent be learned from the data. The second task is to define 
the objective function, or cost function, which is used to measure the descriptive 
power of the models. The third task is to identify the optimal model within the 
model family that best describes the observed data with respect to the objective 
function. This is called learning or model fitting. The details of the modeling pro- 
cess are largely determined by the exact modeling task and particular nature of 
the observations. The objectives of the modeling task are encoded in the selected 
model family, the objective function and to some extent to the model fitting proce- 
dure. The model family determines the space of possible descriptions for the data 
and has therefore a major influence on the flnal solution. The objective function 
can be used to prefer simple models or other aspects in the modeling process. The 
model fitting procedure affects the efficiency and accuracy of the learning process. 
For further information of these and related concepts, see Bishop (2006). A general 
overview of the probabilistic modeling framework is given in the remainder of this 
section. 

3.2.1 Generative modeling 

Generative probabilistic models view the observations as random samples from an 
underlying probability distribution. The model defines a probability distribution 
p{x) over the feature space. The model can be parameterized by model parame- 
ters 6 that specify a particular model within the model family. For convenience, 
we assume that the model family is given, and leave it out from the notation. 
In this thesis, the appropriate model families are selected based on biological hy- 
potheses and analysis goals. Generative models allow efficient representation of 
dependencies between variables, independence assumptions and uncertainty in the 
inference (KoUer and Friedman, 2009). Let us next consider central analysis tasks 
in generative modeling. 

Finite mixture models 

Classical probability distributions provide well-justified and convenient tools for 

probabilistic modeling, but in many practical situations the observed regularities 
in the data cannot be described with a single standard distribution. However, a 
sufficiently rich mixture of standard distributions can provide arbitrarily accurate 
approximations of the observed data. In mixture m,odels, a set of distinct, latent 
processes, or components., is used to describe the observations. The task is to 
identify and characterize the components and their associations to the individual 
observations. The standard formulation assumes independent and identically dis- 
tributed observations where each observation has been generated by exactly one 
component. In a standard mixture model the overall probability density of the 
data is modeled as a weighted sum of component distributions: 



R 




(3.1) 



r=l 
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where the components are indexed by r, and / p(x)(ix = 1. Each mixture compo- 
nent can have a diflFerent distributional form. The mixing proportion, or weight, 
and model parameters of each component are denoted by tt^ and dr, respectively, 
with X^^TTr = 1. Many applications utilize convenient standard distributions, 
such as Gaussians, or other distributions from the exponential family. Then the 
mixture model can be learned for instance with the EM algorithm described in 
Section 3.3.1. 

In practice, the mixing proportions of the components are often unknown. The 
mixing proportions can be estimated from the data by considering them as stan- 
dard model parameters to be fitted with a ML estimate. However, the procedure 
is potentially prone to overfitting and local optima, i.e., it may learn to describe 
the training data well, but fails to generalize to new observations. An alternative, 
probabilistic way to determine the weights is to treat the mixing proportions as 
latent variables with a prior distribution p(7r). A standard choice is a symmetric 
Dirichlct prior^ tt ^ Dir{^). This gives an equal prior weight for each com- 
ponent and guarantees the standard exchangeability assumption of the mixture 
component labels. A label determines cluster identity. Intuitively, exchangeability 
corresponds to the assumption that the analysis is invariant to the ordering of the 
data samples and mixture components. Compared to standard mixture models, 
probabilistic mixture models have increased computational complexity. 

Further prior knowledge can be incorporated in the model by defining prior 
distributions for the other parameters of the mixture model. This can also be used 
to regularize the learning process to avoid overfitting. A typical prior distribution 
for the components of a Gaussian mixture model, parameterized by Or = {f^rj ^r}) 
is the normal-inverse- Gamma prior (see e.g. Gelman et al., 2003). 

Interpreting the mixture components as clusters provides an alternative, prob- 
abilistic formulation of the clustering task. This has made probabilistic mixtme 
models a popular choice in the analysis of functional genomics data sets that 
typically have high dimensionality but small sample size. Probabilistic analysis 
takes the uncertainties into account in a rigorous manner, which is particularly 
useful when the sample size is small. The number of mixture components is of- 
ten unknown in practical modeling tasks, however, and has to be inferred based 
on the data. A straightforward solution can be obtained by employing a suffi- 
ciently large number of components in learning the mixture model, and selecting 
the components having non-zero weights as a post-processing step. An alternative, 
model-based treatment for learning the number of mixture components from the 
data is provided by infinite mixture models considered in Section 3.2.2. 

Latent variables and marginalization 

The observed variables are often affected by latent variables that describe relevant 
structure in the model, but are not directly observed. The latent variable values 
can be, to some extent, inferred based on the observed variables. Combination of 
latent and observed variables allows the description of complex probability spaces 
in terms of simple component distributions and their relations. Use of simple 
component distributions can provide an intuitive and computationally tractable 
characterization of complex generative processes underlying the observations. 

Dirichlet distribution is the probability density Dtr(7r|n) Y\r i""'^"^ where the multivariate 
random variable tt and the positive parameter vector n have their elements indexed by r, < 
TTr < 1, and X)r = 1- 
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A generative latent variable model specifies the distributional form and rela- 
tionships of the latent and observed variables. As a simple example, consider the 
probabilistic interpretation of probabilistic component analysis (PC A), where the 
observations x are modeled with a linear model x = Wz + e where a normally 
distributed latent variable z ~ A^(0,I) is transformed with the parameter matrix 
W and isotropic Gaussian noise (e) is assumed on the observations. More com- 
plex models can be constructed by analogous reasoning. A complete- data likelihood 
p(X, Z\0) defines a joint density for the observed and latent variables. Only a sub- 
set of variables in the modcil is typically of interest for the actual analysis task. For 
instance, the latent variables may be central for describing the generative process 
of the data, but their actual values may be irrelevant. Such variables are called 
nuisance variables. Their integration, or marginalization. provides probabilistic 
averaging over the potential realizations. Marginalization over the latent variables 
in the complete-data likelihood gives the likelihood 



Marginalization over the latent variables collapses the modeling task to finding 
optimal values for model parameters 0, in a way that takes into account the un- 
certainty in latent variable values. This can reduce the number of variables in the 
learning phase, yield more straightforward and robust inferences, as well as speed 
up computation. However, marginalization may lead to analytically intractable 
integrals. As certain latent variables may be directly relevant, marginalization de- 
pends on the overall goals of the analysis and may cover only a subset of the latent 
variables. In this thesis latent variables are utilized for instance in Publication 3, 
which treats the sample-cluster assignments as discrete latent variables, as well as 
in Publication 4, where a regularized latent variable model is introduced to model 
dependencies between co-occurring observations. 

3.2.2 Nonparametric models 

Finite mixture models and latent variable models require the specification of model 

structure prior to the analysis. This can be problematic since for instance the 
number and distributional shape of the generative processes is unknown in many 
practical tasks. However, the model structure can also to some extent be learned 
from the data. Non-parametric models provide principled approaches to learn the 
model structure from the data. In contrast to parametric models, the number and 
use of the parameters in nonparametric models is flexible (see e.g. Hjort et al., 
2010; Miiller and Quintana, 2004). The infinite mixture of Gaussians, used as a 
part of the modeling process in Publication 3, is an example of a non- parametric 
model where both the number of components, as well as mixture proportions of 
the component distributions are inferred from the data. Learning of Bayesian 
network structure is another example of nonparametric inference, where relations 
between the model variables are learned from the data (see e.g. Friedman, 2003). 
While more complex models can describe the training data more accurately, an 
increasing model complexity needs to be penalized to avoid overfitting and to 
ensure generalizability of the model. 

Nonparametric models provide flexible and theoretically principled approaches 
for data-driven exploratory analysis. However, the flexibility often comes with 
an increased computational cost, and the models are potentially more prone to 




(3.2) 
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overfitting than less flexible parametric models. Moreover, complex models can be 
difficult to interpret. 

Many nonparametric probabilistic models are defined by using the theory of 
stochastic processes to impose priors over potential model structures. Stochas- 
tic processes can be used to define priors over function spaces. For instance, the 
Dirichlet process (DP) defines a probability density over the function space of 
Dirichlet distributions^. The Chinese Restaurant Process (CRP) provides an in- 
tuitive description of the Dirichlet process in the cluster analysis context. The 
CRP defines a prior distribution over the number of clusters and their size distri- 
bution. The CRP is a random process in which n customers arrive in a restaurant, 
which has an infinite number of tables. The process goes as follows: The first 
customer chooses the first table. Each subsequent customer m will select a ta- 
ble based on the state Fm-i of the restaurant tables after m — 1 customers have 
arrived. The new customer m will select a previously occupied table i with a 
probability which is proportional to the number of customers seated at table ?', i.e. 
p{i\Fm-i) oc rij. Alternatively, the new customer will select an empty table with a 
probability which is proportional to a constant a. The model prefers tables with 
a larger number of customers, and is analogous to clustering, where the customers 
and tables correspond to samples and clusters, respectively. This provides an in- 
tuitive prior distribution for clustering tasks. The prior prefers compact models 
with relatively few clusters, but the number of clusters is potentially infinite, and 
ultimately determined based on the data. 

Infinite mixture models 

Infinite mixture models are a general class of nonparametric methods where the 

number of mixture components are determined in a data-driven manner; the num- 
ber of components is potentially infinite (see e.g. Miiller and Quintana, 2004; Ras- 
mussen, 2000). An infinite mixture is obtained by letting i? — >^ oo in the finite 
mixture model of Equation 3.1 and replacing the Dirichlet distribution prior of 
the mixing proportions tt by a Dirichlet process. The formal probability distri- 
bution of the Dirichlet process can be intuitively derived with the so-called stick- 
breaking presentation. Consider a unit length stick and a stick-breaking process, 
where the breakpoint /3 is stochastically determined, following the beta distribu- 
tion /? ^ Beta{l,a), where a tunes the expected breaking point. The process 
can be viewed as consecutively breaking off portions of a unit length stick to 
obtain an infinite sequence of stick lengths tti — /3i: iii — ftri/=i(l ^ A)i with 
Si^i ""i = 1 (Ishwaran and James, 2001). This defines the probability distribution 
Stick(Q!) over potential partitionings of the unit stick. A truncated stick-breaking 
representation considers only the first T elements. Setting the prior tt ^ Stick(a), 
defined by the stick- breaking representation in Equation 3.1 assigns a prior on the 
number of mixture components and their mixing proportions that are ultimately 
learned from the observed data. The prior helps to find a compromise between 
increasing model complexity and likelihood of the observations. 

Traditional approaches used to determine the number mixture components are 
based on objective functions that penalize increasing model complexity, for in- 
stance in certain variants of the K-means or in spectral clustering (see e.g. Hastie 

^If G is a distribution drawn from a Dirichlet process with the probability measure P over the 
sample space, G ~ DP(P), then each finite partition {^4^}^. of the sample space is distributed as 
(G(Ai), G(Afc)) ~ Dir{P{Ai), P{A^)). 
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et al., 2009). Other model selection criteria include cross-validation and com- 
parison of the models in terms of their likelihood or various information-theoretic 
criteria that seek a compromise between model complexity and fit (see e.g. Gelman 
et al., 2003). However, the sample size may be insufficient for such approaches, 
and the models may lack a rigorous framework to account for uncertainties in 
the observations and model parameters. Modeling uncertainty in the parameters 
while learning the model structure can lead to more robust inference in nonpara- 
metric probabilistic models but also adds inherent computational complexity in 
the learning process. 



3.2.3 Bayesian analysis 

The term 'Bayesian' refers to interpretation of model parameters as variables. The 
uncertainty over the parameter values, arising from limited empirical evidence, is 
described in terms of probability distributions. This is in contrast to the traditional 
view where parameters have fixed values with no distribution and the uncertainty 
is ignored. The Bayesian approach leads to a learning task where the objective is 
to estimate the posterior distribution p{6\'X.) of the model parameters 0, given the 
observations X. The posterior is given by the Bayes' rule (Bayes, 1763): 

,(«|X, = «e. ,3.3, 

The two key elements of the posterior are the likelihood and the prior. The like- 
lihood p(X.\6) describes the probability of the observations, given the parameter 
values 6. The parameters can also characterize alternative model structures. The 
prior p{0) encodes prior beliefs about the model and rewards solutions that match 
with the prior assumptions or yield simpler models. Such regularizing properties 
can be particularly useful when training data is scarce and there is considerable 
uncertainty in the parameter estimates. With strong, informative priors, new ob- 
servations have little effect on the posterior. In the limit of large sample size the 
posterior converges to the ordinary likelihood p{X\d). The Bayesian inference pro- 
vides a robust framework for taking the uncertainties into account when the data 
is scarce, as it often is in practical modeling tasks. Moreover, the Bayes' rule pro- 
vides a formal framework for sequential update of beliefs based on accumulating 
evidence. The prior predictive density p(X) = J p(X, 0)d6 is a normalizing con- 
stant, which is independent of the parameters 6 and can often be ignored during 
model fitting. 

The involved distributions can have complex non-standard forms and limited 

empirical data can only provide partial evidence regarding the different aspects 
of the data-generating process. Often only a subset of the parameters and other 
variables and their interdependencies can be directly observed. The Bayesian ap- 
proach provides a framework for making inferences on the unobserved quantities 
through hierarchical models, where the probability distribution of each variable 
is characterized by higher- level parameters, so-called hyperparameters. A similar 
reasoning can be used to model the uncertainty in the hyperparameters, until the 
uncertainties become modeled at an appropriate detail. Prior information can help 
to compensate the lack of data on certain aspects of a model, and explicit models 
for the noise can characterize uncertainty in the empirical observations. Distribu- 
tions can also share parameters, which provides a basis for pooling evidence from 
multiple sources, as for instance in Publication 4. In many applications only a 
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subset of the parameters in the model are of interest and the modchng process 
can be considerably simplified by marginalizing over the less interesting nuisance 
variables to obtain an expectation over their potential values. 

The Bayesian paradigm provides a principled framework for modeling the un- 
certainty at all levels of statistical inference, including the parameters, the observed 
and latent variables and the model structure; all information of the model is in- 
corporated in the posterior distribution, which summarizes empirical evidence and 
prior knowledge, and provides a complete description of the expected outcomes of 
the data-generating process. When the data docs not contain sufficient informa- 
tion to decide between the alternative model structures and parameter values, the 
Bayesian framework provides tools to take expectations over all potential models, 
weighted by their relative evidence. 

A central challenge in the Bayesian analysis is that the models often include 
analytically intractable posterior distributions, and learning of the models can be 
computationally demanding. Widely-used approaches for estimating posterior dis- 
tributions include Markov Chain Monte Carlo (MCMC) methods and variational 
learning. Stochastic MCMC methods provide a widely-used family of algorithms 
to estimate intractable distributions by drawing random samples from these distri- 
butions (see e.g. Gelman et al., 2003); a sufficiently large pool of random samples 
will converge to the underlying distribution, and sample statistics can then be used 
to characterize the distribution. However, sampling-based methods are computa- 
tionally intensive and slow. In variational learning, considered in Section 3.3.1, 
the intractable distributions are approximated by more convenient tractable dis- 
tributions, which yields faster learning procedure, but potentially less accurate 
results. While analysis of the full posterior distribution will provide a complete 
description of the uncertainties regarding the parameters, simplified summary sta- 
tistics, such as the mean, variance and quantiles of the posterior can provide a 
sufficient characterization of the posterior in many practical applications. They 
can be obtained for instance by summarizing the output of sampling-based or vari- 
ational methods. Moreover, when the uncertainty in the results can be ignored, 
point estimates can provide simple, interpretable summaries that are often useful 
in further biomedical analysis, as for instance in Publication 2. Point estimates are 
single optimal values with no distribution. However, point estimates are not nec- 
essarily sufficient for instance in biomedical diagnostics and other prediction tasks, 
where different outcomes are associated with different costs and it may be crucial 
to assess the probabilities of the alternative outcomes. For further discussion on 
learning the Bayesian models, see Section 3.3.1. 

In this thesis the Bayesian approach provides a formal framework to perform 
robust inference based on incomplete functional genomics data sets and to incor- 
porate prior information of the models in the analysis. The Bayesian paradigm 
can alternatively be interpreted as a philosophical position, where probability is 
viewed as a subjective concept (Cox, 1946), or considered a direct consequence 
of making rational decisions under uncertainty (Bernardo and Smith, 2000). For 
further concepts in model selection, comparison and averaging in the Bayesian 
analysis, see Gelman et al. (2003). For applications in computational biology, see 
Wilkinson (2007). 
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3.3 Learning and inference 

The final stage in probabilistic modeling is to learn the optimal statistical presen- 
tation for the data, given the model family and the objective function. This section 
highlights central challenges and methodological issues in statistical learning. 

3.3.1 Model fitting 

Learning in probabilistic models often focuses on optimizing the model parame- 
ters 6. In addition, posterior distribution of the latent variables, p(z|x, 0), can be 
calculated. Estimating the latent variable values is called statistical inference. In 
the Bayesian analysis, the model parameters can also be treated as latent vari- 
ables with a prior probability density, in which case the distinction between model 
parameters and latent variables will disappear. A comprehensive characterization 
of the variables and their uncertainty would be achieved by estimating the ftill 
posterior distribution. However, this can be computationally very demanding, in 
particular when the posterior is not analytically tractable. The posterior is often 
approximated with stochastic or analytical procedures, such as stochastic MCMC 
sampling methods or variational approximations, and appropriate summary sta- 
tistics. In many practical settings, it is sufficient to summarize the full posterior 
distribution with a point estimate. Point estimates do not characterize the uncer- 
tainties in the analysis result, but are often more convenient to interpret than ftill 
posterior distributions. 

Various optimization algorithms are available to learn statistical models, given 
the learning procedure. The potential challenges in the optimization include com- 
putational complexity and the presence of local optima on complex probability 
density topologies, as well as unidentifiability of the models. Finding a global op- 
timum of a complex model can be computationally exhaiistive, and it can become 
intractable with increasing sample size. In unidentifiable models, the data does 
not contain sufficient information to choose between alternative models with equal 
statistical evidence. Ultimately, the uncertainty in inference arises from limited 
sample size and the lack of computational resources. 

In the remainder of this section, let us consider more closely the particular 
learning procedures central to this thesis: point estimates and variational approx- 
imation, and the standard optimization algorithms used to learn such representa- 
tions. 

Point estimates 

Assuming independent and identically distributed observations, the likelihood of 
the data, given model parameters, is p(X|0) = JJjp(xj|0). This provides a prob- 
abilistic measure of model fit and the objective function to maximize. Maximiza- 
tion of the likelihood p(K\9) with respect to 9 yields a maximum likelihood (ML) 
estimate of the model parameters, and specifies an optimal model that best de- 
scribes the data. This is a standard point estimate used in probabilistic modeling. 
Practical implementations typically operate on log-likelihood, the logarithm of the 
likelihood function. As a monotone function, this yields the same optima, but has 
additional desirable properties: it factorizes the product into a sum and is less 
prone to numerical overflows during optimization. 



25 



CHAPTER 3. STATISTICAL LEARNING AND EXPLORATORY DATA ANALYSIS 



The maximum a posteriori (MAP) estimate additionally takes prior informa- 
tion of the model parameters into account. While the ML estimate maximizes 
the likelihood |5(X|0) of the observations, the MAP estimate maximizes the pos- 
terior p{d\X.) ^ p{li.\0)p{6) of the model parameters. The objective function to 
maximize is the log-likelihood 

logp{e\X.) - logpCKie) + logp{e). (3.4) 

The prior is explicit in MAP estimation and the model contains the ML esti- 
mate as a special case; assuming large sample size, or non-informative, uniform 
prior p{6) ^ 1, the likelihood of the data p(X|0) will dominate and the MAP esti- 
mation becomes equivalent to optimizing p(X|0), yielding the traditional ML esti- 
mate. The ML and MAP estimates are asymptotically consistent approximations 
of the posterior distribution, since the posterior will converge a point distribution 
with a large sample size. The computation and interpretation of point estimates is 
straightforward compared to the use of posterior distributions in the full Bayesian 
treatment. The differences between ML and MAP estimates highlight the role of 
prior information in the modeling when training data is limited. 

Variational inference 

In certain modeling tasks the uncertainty in the model parameters needs to be 
taken into account. Then point estimates are not sufficient. The uncertainty is 
characterized by the posterior distribution p(0|X). However, the posterior distri- 
butions arc; oftc^n intractabk; and need to be estimated by approximative methods. 
Variational approximations provide a fast and principled optimization scheme (see 
e.g. Bishop, 2006) that yields only approximative solutions, but can accelerate 
posterior inference by orders of magnitude compared to stochastic, sampling-based 
MCMC methods that can in principle provide exact solutions, assuming that in- 
finite computational resources are available. The potential decrease in accuracy 
in variational approximations is often acceptable, given the gains in efficiency. 
Variational approximation characterizes the uncertainty in 6 with a tractable dis- 
tribution q{0) that approximates the full, potentially intractable posterior p(0|X), 
Variational inference is formulated as an optimization problem where an in- 
tractable posterior distribution p(Z, 0|X) is approximated by a more easily tract- 
able distribution g(Z, 0) by minimizing the KL-divergence between the two distri- 
butions. This is also shown to maximize a lower bound of the marginal likelihood 
p(X), and subsequently the likelihood of the data, yielding an approximation of 
the overall model. The log-likelihood of the data can be decomposed into a sum 
of the lower bound C{q) of the observed data and the KL divergence dKL{<l,p) 
between the approximative and the exact posterior distributions: 

logp{K)=/:{q)+dKL{q,p), (3.5) 

where 

C(q) = J^qiZ,9)logS^; 

dKLiq.p) = -J^q{Z,9)log2i^. ^"^'^^ 

The KL-divergence is non-negative, and equals to zero if and only if the ap- 
proximation and the exact distribution are identical. Therefore C{q) gives a 



26 



3.3. LEARNING AND INFERENCE 



lower bound for the log-likelihood logp(X.) in Equation 3.5. Minimization of (Ikl 
with respect to q will provide an analytically tractable approximation g(Z, 6) of 
p{Z, 0|X). Minimization of cIkl will also maximize the lower bound C{q) since the 
log-likelihood logp{X.) is independent of q. The approximation typically assumes 
independent parameters and latent variables, yielding a factorized approximation 
q{Z,d) = <7z(Z)ge(^) based on tractable standard distributions. It is also possi- 
ble to factorize q^ and qe into further components. Variational approximations 
are used for efficient learning of infinite multivariate Gaussian mixture models in 
Publication 3. 

Expectation Maximization (EM) 

The EM algorithm is a general procedure for learning probabilistic latent variable 
models (Dempster et al., 1977), and a special case of variational inference. The 
algorithm provides an efficient algorithm for finding point estimates for model 
parameters in latent variable models. The objective of the EM algorithm is to 
maximize the marginal likelihood 



of the observations X with respect to the model parameters 0. Marginalization 
over the probability density of the latent variables provides an inference procedure 
that is robust to uncertainty in the latent variable values. The algorithm iterates 
between estimating the posterior of the latent variables, and optimizing the model 
parameters (see e.g. Bishop, 2006). Given initial values do of the model param- 
eters, the expectation step evaluates the posterior density of the latent variables, 
p{z\x,6t), keeping 6t fixed. If the posterior is not analytically tractable, varia- 
tional approximation g(z) can be used to obtain a lower bound for the likelihood 
in Equation 3.7. The maximization step optimizes the model parameters 6 with 
respect to the following objective function: 



This is the expectation of the complete-data log-likelihood, Zo9p(X, Z|0) over the 
latent variable density p(Z|X, 0t), obtained from the previous expectation step. 
The new parameter estimate is then 



The expectation and maximization steps determine an iterative learning pro- 
cedure where the latent variable density and model parameters are iteratively 
updated until convergence. The maximization step will also increase the target 
likelihood of Equation 3.7, but potentially with a remarkably smaller computa- 
tional cost (Dempster et al., 1977). In contrast to the marginal likelihood in 
Equation 3.7, the complete-data likelihood in Equation 3.8 is logarithmized be- 
fore integration in the maximization step. When the joint distribution p(x, z|0) 
belongs to the exponential family, the logarithm will cancel the exponential in 
algebraic manipulations. This can considerably simplify the maximization step. 
When the likelihoods in the optimization are of suitable form, the iteration steps 
can be solved analytically, which can considerably reduce required evaluations of 




(3.7) 




(3.8) 



Ot+i = argmaxgQ{9,6t). 
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the objective function. Convergence is guaranteed, if the optimization can increase 
the Ukehhood at each iteration. However, the identification of a global optimum 
is not guaranteed in the EM algorithm. 

Incorporating prior information of the parameter values through Bayesian pri- 
ors can be used to avoid overfitting and focus the modeling on particular features 
in the data, as in the regularized dependency modeling framework of Publication 4, 
where the EM algorithm is used to learn Gaussian latent variable models. 

Standard optimization methods 

Optimization methods provide standard tools to implement selected learning pro- 
cedures. Optimization algorithms arc used to identify parameter values that min- 
imize or maximize the objective function, either globally, or in local surroundings 
of the optimized value. Selection of optimization method depends on smooth- 
ness and continuity properties of the objective function, required accuracy, and 
available resources. 

Gradient-based approaches optimize the objective function by assuming smooth, 
continuous topology over the probability density where setting the derivatives to 
zero will yield local optima. If a closed form solution is not available, it is of- 
ten possible to estimate gradient directions in a given point. Optimization can 
then proceed by updating the parameters towards the desired direction along the 
gradient, gradually improving the objective function value in subsequent gradient 
ascent steps. So-called quasi- Newton methods use function values and gradients 
to characterize the optimized manifold, and to optimize the parameters along the 
approximated gradients. An appropriate step length is identified automatically 
based on the curvature of the objection function surface. The Broyden-Fletcher- 
Goldfarb-Shanno (BFGS) (Broyden, 1970; Fletcher, 1970; Goldfarb, 1970; Shanno, 
1970) method is a quasi-Newton approach used for standard optimization tasks in 
this thesis. 

3.3.2 Generalizability and overlearning 

Probabilistic models are formulated in terms of probability distributions over the 
sample space and parameter values. This forms the basis for generalization to 
new, unobserved events. A generalizable model can describe essential character- 
istics of the underlying process that generated the observations; a generalizable 
model is also able to characterize future observations. Overlearning, or overfitting 
refers to models that describe the training data well, but do not generalize to new 
observations. Such models describe not only the general processes underlying the 
observations, but also noise in the particular observations. Avoiding overfitting 
is a central aspect in modeling. Overlearning is particularly likely when training 
data is scarce. While overfitting could in principle be avoided by collecting more 
data, this is often not feasible since the cost of data collection can be prohibitively 
large. 

Generalizability can be measured by investigating how accurately the model 
describes new observations. A standard approach is to split the data into a training 
set, used to learn the model, and a test set, used to measure model performance on 
unseen observations that were not used for training. In cross-validation the test 
is repeated with several different learning and test sets to assess the variability 
in the testing procedure. Cross-validation is used for instance in Publication 5 of 
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this thesis. Bootstrap analysis (see, for instance, Efron and Tibshirani, 1994) is 
another widely used approach to measure model performance. The observed data 
is viewed as a finite realization of an underlying probability density. New samples 
from the underlying density are obtained by re-sampling the observed data points 
with replacement to simulate variability in the original data; observations from 
the more dense regions of the probability space become re-sampled more often 
than rare events. Each bootstrap sample resembles the probability density of the 
original data. Modeling multiple data sets obtained with the bootstrap helps to 
estimate the sensitivity of the model to variations in the data. Bootstrap is used 
to assess model performance in Publication 6. 

3.3.3 Reguleirization and model selection 

In general, increasing model complexity will yield more fiexible models, which have 
higher descriptive power but are, on the other hand, more likely to ovcrfit. There- 
fore relatively simple models can often outperform more complex models in terms 
of generalizability. A compromise between simplicity and descriptive power can be 
obtained by imposing additional constraints or soft penalties in the modeling to 
prefer compact solutions, but at the same time retain the descriptive power of the 
original, flexible model family. This is called regularization. Regularization is par- 
ticiilarly important when the sample size is small, as demonstrated for instance in 
Publication 4, where explicit and theoretically principled regularization is achieved 
by setting appropriate priors on the model structure and parameter values. The 
priors will then affect the MAP estimate of the model parameters. One commonly 
used approach is to prefer sparse solutions that allow only a small number of the 
potential parameters to be employed at the same time to model the data (see e.g. 
Archambeau and Bach, 2008). A family of probabilistic approaches to balance 
between model fit and model complexity is provided by information-theoretic cri- 
teria (see e.g. Gelman et al., 2003). The Bayesian Information Criterion (BIC) is 
a widely used information criterion that introduces a penalty term on the number 
of model parameters to prefer simpler models. The log-likelihood £ of the data, 
given the model, is balanced by a measure of model complexity, qlog{N), in the 
final objective function —2C + qlog{N). Here q denotes the number of model pa- 
rameters and N is the constant sample size of the investigated data set. The BIC 
has been criticized since it does not address changes in prior distributions, and 
its derivation is based on asymptotic considerations that hold only approximately 
with finite sample size (see e.g. Bishop, 2006). On the other hand, BIC provides a 
principled regularization procedure that is easy to implement. In this thesis, the 
BIC has been used to regularize the algorithms in Publication 3. 

3.3.4 Validation 

After learning a probabilistic model, it is necessary to confirm the quality of the 
model and verify potential findings in further, independent experiments. Valida- 
tion refers to a versatile set of approaches used to investigate model performance, 
as well as in model criticism, comparison and selection. Internal and external 
approaches provide two complementary categories for model validation. Inter- 
nal validation refers to procedures to assess model performance based on training 
data alone. For instance, it is possible to estimate the sensitivity of the model 
to initialization, parameterization, and variations in the data, or convergence of 
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the learning process. Internal analysis can help to estimate the weaknesses and 
generalizability of the model, and to compare alternative models. Bootstrap and 
cross-validation are widely used approaches for internal validation and the analysis 
of model performance (see e.g. Bishop, 2006). Bootstrap can provide information 
about the sensitivity of the results to sampling effects in the data. Cross-validation 
provides information about the model generalization performance and robustness 
by comparing predictions of the model to real outcomes. External validation ap- 
proaches investigate model predictions and fit on new, independent data sets and 
experiments. Exploratory analysis of high-throughput data sets often includes 
massive multiple testing, and provides potentially thousands of automatically gen- 
erated hypotheses. Only a small set of the initial findings can be investigated more 
closely by human intervention and costly laboratory experiments. This highlights 
the need to prioritize the results and assess the uncertainty in the models. 
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Reducing uncertainty in 
high-throughput microarray 
studies 

As far as the laws of mathematics refer to reality, they are not certain, 
as far as they are certain, they do not refer to reality. 

A. Einstein (1956) 

Gene expression microarrays are currently the most widely used technology for 
genome- wide transcriptional profiling, and they constitute the main source of data 
in this thesis. An overview of microarray technology is provided in Section 2.3.1. 
Microarray measurements are associated with high levels of noise from technical 
and biological sources. Appropriate preprocessing techniques can help to reduce 
noise and obtain reliable measurements, which is the crucial starting point for 
any data analysis task. This chapter presents the first main contribution of the 
thesis, preprocessing techniques that utilize side information in genomic sequence 
databases and microarray data collections in order to improve the accuracy of high- 
throughput gene expression data. The chapter is organized as follows: Section 4.1 
gives an overview of the various sources of noise in high-throughput microarray 
studies. Section 4.2 introduces a strategy for noise reduction based on side infor- 
mation in external genomic sequence databases. Section 4.3 extends this model by 
describing a model-based approach that additionally combines statistical evidence 
across multiple microarray experiments in order to provide quantitative informa- 
tion of probe performance and utilizes this information to improve the reliability 
of high-throughput observations. The results are summarized in Section 4.4. 

4.1 Sources of uncertainty 

Measurement data obtained with novel high-throughput technologies comes with 
high levels of uncontrolled biological and technical variation. This is often called 
noise as it obscures the measurements, and adds potential bias and variance on 
the signal of interest. Biological noise is associated with natural biological varia- 
tion between cell populations, cellular processes and individuals. Single-nucleotide 
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polymorphisms, alternative splicing and non-specific hybridization add biological 
variation in the data (Dai et al., 2005; Zhang et al., 2005). More technical sources 
of noise in the measurement process include RNA extraction and amplification, 
experiment-specific variation, as well as platform- and laboratory-specific effects 
(Choi et al., 2003; MAQC Consortium, 2006; Tu et al., 2002). 

A significant source of noise on gene expression arrays comes from individ- 
ual probes that are designed to measure the activity of a given transcript in a 
biological sample. Figure ??A shows probe-level observations of differential gene 
expression for a collection of probes designed to target the same mRNA transcript. 
One of the probes is highly contaminated and likely to add unrelated variation to 
the analysis. A number of factors affect probe performance. For instance, it has 
been reported in Publication 1 and elsewhere (Hwang ct al., 2004; Mccham ct al., 
2004b) that a large portion of microarray probes may target unintended mRNA 
sequences. Moreover, although the probes have been designed to uniquely hy- 
bridize with their intended mRNA target, remarkable cross-hybridization with the 
probes by single- nucleotide polymorphisms (Dai et al., 2005; Sliwerska et al., 2007) 
and other mRNAs with closely similar sequences (Zhang et al., 2005) have been 
reported; liigh-afFinity probes with high GC-content may have higher likelihood of 
cross- hybridization with nonspecific targets (Mei et al., 2003). Alternative splic- 
ing (MAQC Consortium, 2006) and mRNA degradation (Auer et al., 2003) may 
cause differences between probes targeting different positions of the gene sequence. 
Such effects will contribute to probe-level contamination in a probe- and condition- 
specific manner. However, sources of probe-level noise are still poorly understood 
(Irizarry et al., 2005; Li et al., 2005) despite their importance for expression anal- 
ysis and probe design. 

High levels of noise set specific challenges for analysis. Better understanding of 
the technical aspects of the measurement process will lead to improved analytical 
procedures and ultimately to more accurate biological results (Reimers, 2010). 
Publication 2 provides computational tools to investigate probe performance and 
the relative contributions of the various sources of probe-level contamination on 
short oligonucleotide arrays. 

4.2 Preprocessing microarray data with side in- 
formation 

Preprocessing of the raw data obtained from the original measurements can help 
to reduce noise and improve comparability between microarray experiments. Pre- 
processing can be defined in terms of statistical transformations on the raw data, 
and this is a central part of data analysis in high-throughput studies. This sec- 
tion outlines the standard preprocessing steps for short oligonucleotide arrays, the 
main source of transcriptional profiling data in this thesis. However, the general 
concepts also apply to other microarray platforms (Reimers, 2010). 

Standard preprocessing steps 

A number of preprocessing techniques for short oligonucleotide arrays have been 

introduced (Irizarry et al., 2006; Reimers, 2010). The standard preprocessing steps 
in microarray analysis include quality control, background correction, normaliza- 
tion and summarization. 
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Figure 4.1: A Example of a probe set that contains a probe with high contamination levels 
(dashed line) detected by the probabilistic RPA model. The probe-level observations of differen- 
tial gene expression for the different probes that measure the same target transcript are indicated 
by gray lines. The black line shows the estimated signal of the target transcript across a number 
of conditions. B Increase in the average variance of the probes associated with the investigated 
noise sources: mistargeted probes having errors in the genomic alignment, most 5'/3' probes 
of each probe set, GC-rich, and SNP-associated probes. The variances were estimated by RPA 
and describe the noise level of the probes. The results axe shown for the individual ALL and 
GEA data sets, and for their combined results on both platforms (133A and 95A/Av2). ©IEEE. 
Reprinted with permission from Publication 2. 

Microarray quality control is used to identify arrays with remarkable experi- 
mental defects, and to remove them from subsequent analysis. The typical tests 
consider RNA degradation levels and a number of other summary statistics to 
guarantee that the array data is of reasonable quality. The arrays that pass the 
microarray quality control are preprocessed further. Each array typically has spa- 
tial biases that vary smoothly across the array, arising from technical factors in 
the experiment. Background correction is used to detect and remove such spatial 
effects from the array data, and to provide a uniform background signal, enhanc- 
ing the comparability of the probe-level observations between different parts of 
the array. Moreover, background correction can estimate the general noise level 
on the array; this helps to detect probes whose signal differs significantly from the 
background noise. Robust multi-array averaging (RMA) is one of the most widely 
used approaches for preprocessing short oligonucleotide array data (Irizarry et al., 
2003a). The background correction in RMA is based on a global model for probe 
intensities. The observed intensity, Y, is modeled as a sum of an exponential 
signal component, S and Gaussian noise B. Background corrected data is then 
obtained as the expectation EB(5|y). While background correction makes the 
observations comparable within array, normalization is used to improve the com- 
parability between arrays. Quantile normalization is a widely used method that 
forces all arrays to follow the same empirical intensity distribution (see e.g. Bol- 
stad et al., 2003). Quantile normalization makes the measurements across different 
arrays comparable, assuming that the overall distribution of mRNA concentration 
is approximately the same in all cell populations. This has proven to be a feasible 
assumption in transcriptional profiling studies. As always, there are exceptions. 
For instance, human brain tissues have systematic differences in gene expression 
compared to other organs. On short oligonucleotide arrays, a number of probes 
target the same transcript. In the final summarization step, the individual probe- 
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level observations of each target transcript are summarized into a single summary 
estimate of transcript activity. Standard algorithmic implementations are available 
for each preprocessing step. 

Probe-level preprocessing methods 

Differences in probe characteristics cause systematic differences in probe perfor- 
mance. The use of several probes for each target leads to more robust estimates 
on transcript activity but it is clear that probe quality may significantly affect 
the results of a microarray study (Irizarry et al., 2003b). Widely used prepro- 
cessing algorithms utilize probe-specific parameters to model probe-specific effects 
in the probe summarization step. Some of the first and most well-known probe- 
level preprocessing algorithms include dChip/MBEI (Li and Wong, 2001), RMA 
(Irizarry ct al., 2003a), and gMOS (Milo ct al., 2003). Taking probe- level effects 
into account can considerably improve the quality of a microarray study (Reimers, 
2010). Publications 1 and 2 incorporate side information of the probes to prepro- 
cessing, and introduce improved probe-level analysis methods for differential gene 
expression studies. 

In order to introduce probe-level preprocessing methods in more detail, let 

us consider the probe summarization step of the RMA algorithm (Irizarry et al., 
2003a). RMA has a Gaussian model for probe effects with probe-specific mean 
parameters and a shared variance parameter for the probes. The mean parameters 
characterize probe-specific binding affinities that cause systematic differences in 
the signal levels captured by each probe. Estimating the probe-specific effects 
helps to remove this effect in the final probeset-level summary of the probe-level 
observations. To briefly outline the algorithm, let us consider a collection of probes 
(a probeset) that measure the expression level of the same target transcript g 
in condition i. The probe-level observations are modeled as a sum of the true, 
underlying expression signal g;, which is common to all probes, probe-speciflc 
binding affinity Hj, and Gaussian noise e. A probe- level observation for probe j in 
condition i is then modeled in RMA as 

Sij=gi + Ho+e- (4.1) 

Measurements from multiple conditions are needed to estimate the probe- 
specific effects ^j. RMA and other models that measure absolute gene expression 
have an important drawback: the probe affinity effects {nj} are unidentifiable. In 
order to obtain an identifiable model, the RMA algorithm includes an additional 
constraint that the probe affinity effects are zero on average: Sj/i^ = 0. This yields 
a well-defined algorithm that has been shown to produce accurate measurements 
of gene expression in practical settings. Further extensions of the RMA algorithm 
include gcRMA, which has a more detailed chemical model for the probe effects 
(Wu and Irizarry, 2004), refRMA (Katz et al., 2006), which utilizes probe-specific 
effects derived from background data collections, and fRMA (McCall et al., 2010), 
which also models batch-specific effects in microarray studies. The estimation of 
imidentifiable probe affinities is a main challenge for most probe-level preprocess- 
ing models. 

RMA and other probe-level models for short oligonucleotide arrays have been 
designed to estimate absolute expression levels of the genes. However, gene expres- 
sion studies are often ultimately targeted at investigating differential expression 
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levels, that is, differences in gene expression between experimental conditions. 

Measurements of differential expression is obtained for instance by comparing the 
expression levels, obtained through the RMA algorithm or other methods, between 
different conditions. However, the summarization of the probe-level values is then 
performed prior to the actual comparison. Due to the unidentifiability of the probe 
affinity parameters in the RMA and other probe-level models, this is potentially 
suboptimal. Publication 1 demonstrates that reversing the order, i.e., calculat- 
ing differential gene expression already at the probe level before probeset-level 
summarization, leads to improved estimates of differential gene expression. The 
explanation is that the procedure circumvents the need to estimate the unidentifi- 
able probe affinity parameters. This is formally described in Publication 2, which 
provides a probabilistic extension of the Probe-level Expression Change Averag- 
ing (PEC A) procedure of Publication 1. In PEC A, a standard weighted average 
statistics summarizes the probe level observations of differential gene expression. 
PECA does not model probe-specific effects, but it is shown to outperform widely 
used probe-level preprocessing methods, such as the RMA, in estimating differen- 
tial expression. Publication 2, considered in more detail in Section 4.3, provides 
an extended probabilistic framework that also models probe-specific effects. 

Utilizing side information in transcriptome databcises 

Probe-level preprocessing models and microarray analysis can be further improved 
by utilizing external information of the probes (Eisenstein, 2006; Hwang et al., 
2004; Katz et al., 2006). Although any given microarray is designed on most up- 
to-date sequence information available, rapidly evolving genomic sequence data 
can reveal inaccuracies in probe annotations when the body of knowledge grows. 
In recent studies, including Publication 1, a remarkable number of probes on vari- 
ous oligonucleotide arrays have been detected not to uniquely match their intended 
target (Hwang et al., 2004; Mecham et al., 2004a). A remarkable portion of probes 
on several popular microarray platforms in human and mouse did not match with 
their intended mRNA target, or were found to target unintended mRNA tran- 
scripts in the Entrez Nucleotide (Wheeler et al., 2005) sequence database in Publi- 
cation 1 (Table ??). The observations are in general concordant with other studies, 
although the exact figures vary according to the utilized database and compari- 
son details (Gautier et al., 2004; Mecham et al., 2004b). In this thesis, strategies 
are developed to improve microarray analysis with background information from 
genomic sequence databases, and with model-based analysis of microarray collec- 
tions. 

Probe verification is increasingly used in standard preprocessing, and to con- 
firm the results of a microarray study. Matching the probe sequences of a given 
array to updated genomic sequence databases and constructing an alternative in- 
terpretation of the array data based on the most up-to-date genomic annotations 

has been shown to increase the accuracy and cross-platform consistency of mi- 
croarray analyses in Publication 1 and elsewhere (Dai et al., 2005; Gautier et al., 
2004). 

Publication 1 combines probe verification with a novel probe-level preprocess- 
ing method, PECA, to suggest a novel framework for comparing and combining 
results across different microarray platforms. While huge repositories of microarray 
data are available, the data for any particular experimental condition is typically 
scarce, and coming from a number of different microarray platforms. Therefore 
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reliable approaches for integrating microarray data are valuable. Integration of 

results across platforms has proven problematic due to various sources of technical 
variation between array technologies. Matching of probe sequences between mi- 
croarray platforms has been shown to increase the consistency of microarray mea- 
surements (Hwang et al., 2004; Mecham ct al., 2004b). However, probe matching 
between array platforms guarantees only technical comparability (Irizarry et al., 
2005). Probe verification against external sequence databases is needed to con- 
firm that the probes are also biologically accurate. This can also improve the 
comparability across array platforms, as confirmed by the validation studies in 
Publication 1 (Figure ??A). 

The PECA method of Publication 1 utilizes genomic sequence databases to 
reduce probe-level noise by removing erroneous probes based on updated genomic 
knowledge. The strategy relies on external information in the databases and can 
therefore only remove known sources of probe-level contamination. Publication 2 
introduces a probabilistic framework to measure probe reliability directly based on 
microarray data collections. The analysis can reveal both well-characterized and 
unknown sources of probe-level contamination, and leads to improved estimates of 
gene expression. This model, coined Robust Probabilistic Averaging (RPA), also 
provides a theoretically justified framework for incorporating prior knowledge of 
the probes into the analysis. 

Array type Number of probes Verified probes (%) 



HG-U133 Plus2.0 604,258 58.2 

HG-U133A 247,965 82.5 

HG-U95Av2 199,084 82.6 

MOE430 2.0 496,468 68.2 

MG-U74Av2 197,993 73.1 



Table 4.1: The proportion of sequence- verified probes on three popular human microarray plat- 
forms and two mouse platforms, as observed in Publication 1. Probes that matched to mRNA 
sequences corresponding to unique genes (defined by a GenelD identifier) in the Entrez database 
axe considered verified. A remarkable portion of the probes on the investigated arrays did not 
match the Entrez transcript sequences, or had ambiguous targets. 



4.3 Mo del- based noise reduction 

Standard approaches for investigating probe performance typically rely on external 
information, such as genomic sequence data (see Mecham et al. 2004b; Zhang et al. 
2005 and Publication 1) or physical models (Naef and Magnasco, 2003; Wu et al., 
2005). However, such models cannot reveal probes with uncharacterized sources of 
contamination, such as cross-hybridization with alternatively spliced transcripts or 
closely related mRNA sequences. Vast collections of microarray data are available 
in public repositories. These large-scale data sets contain valuable information 
of both biological and technical aspects of gene expression studies. Publication 2 
introduces a data-driven strategy to extract and utilize probe-level information in 
microarray data collections. 
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Figure 4.2: A Effect of sequence verification on comparability between microarray platforms. 
Correlations between RMA-preprocessed technical replicates on two array platforms where the 
same samples have been hybridized on the two array types. The Pearson correlations were 
calculated for each pair of arrays measuring the same biological sample. The gray lines show 
correlations obtained with the different probe matching criteria. In the hESC array comparison, 
the best match probe sets contained exactly the same probes on both array generations, which 
resulted in very high correlations. The advantages of probe verification and alternative mappings 
were largest when arrays with different probe collections were compared in the mCPI, ALL and 
IM array comparisons. B Reproducibility of signal estimates in real data sets between the 
technical replicates, i.e., the 'best match' probe sets between the HG-U95Av2 and HG-U133A 
platforms. The consistency was measured by the Pearson correlation between the pairs of arrays, 
to which the same sample was hybridized. ©Published by Oxford University Press. Reprinted 
with permission from Publication 1. 



The model, Robust Probabilistic Averaging (RPA), is a probabilistic prepro- 
cessing procedure that is based on explicit modeling assumptions to analyze probe 
reliability and quantify the uncertainty in measurement data based on gene ex- 
pression data collections, independently of external information of the probes. The 
model can be viewed as a probabilistic extension of the probe-level preprocessing 
approach for differential gene expression studies presented in Publication 1. The 
explicit Bayesian formulation quantifies the uncertainty in the model parameters, 
and allows the incorporation of prior information concerning probe reliability into 
the analysis. RPA provides estimates of probe reliability, and a probeset-level 
estimate of differential gene expression directly from expression data and indepen- 
dently of the noise source. The RPA model is independent of physical models or 
external and constantly updated information such as genomic sequence data, but 
provides a framework for incorporating such prior information of the probes in 
gene expression analysis. 

Other probabilistic methods for microarray preprocessing include BGX (Hein 
et al., 2005), gMOS (Milo et al., 2003) and its extensions (Liu et al., 2005). The 
key difference to the RPA procedure of Publication 2 is that these methods are 
designed to provide probeset-level summaries of absolute gene expression levels, 
and suffer from the same unidentifiability problem of probe affinity parameters 
as the RMA algorithm (Irizarry et al., 2003a). In contrast, RPA models probe- 
level estimates of differential gene expression. This removes the unidentifiability 
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issue, which is advantageous when the objective is to compare gene expression 

levels between experimental conditions. Another important difference is that the 
other preprocessing methods do not provide explicit estimates of probe-specific 
parameters, or tools to investigate probe performance. Publication 2 assigns an 
explicit probabilistic measure of reliability to each probe. This gives tools to 
analyze probe performance and to guide probe design. 

Robust Probabilistic Averaging 

Let us now consider in more detail the probabilistic preprocessing framework, 
RPA, introduced in Publication 2. Probe performance is ultimately determined 
by its ability to accurately measure the expression level of the target transcript, 
which is unknown in practical situations. Although the performance of individual 
probes varies, the collection of probes designed to measure the same transcript 
will provide ground truth for assessing probe performance (Figure ??A). RPA 
captures the shared signal of the probes within a probcsct, and assumes that 
the shared signal characterizes the expression of the common target transcript 
of the probes. The reliability of individual probes is estimated with respect to 
the strongest shared signal of the probes. RPA assumes normally distributed 
probe effects, and quantifies probe reliability based on probe variance around the 
probeset-level signal across a large number of arrays. This extends the formulation 
of the RMA model in Equation 4.1 by introducing an additional probe-specific 
Gaussian noise component: 

Sij=gi + fJ'j+£ij- (4.2) 

In contrast to RMA, the variance is probe-specific in this model, and distributed 
as Sij ~ N{0,Tj). The variance parameters {t|} are of interest in probe reli- 
ability analysis; they reflect the noise Icvc;! of the probe, in contrast to probe- 
level preprocessing methods that focus on estimating the unidentifiable mean pa- 
rameter of the Gaussian noise model, corresponding to probe affinity (see e.g. 
Irizarry et al., 2003a; Li and Wong, 2001). In Publication 2, probe-level cal- 
culation of differential expression avoids the need to model unidentifiable probe 
affinities, the key probe-specific parameter in other probe-level preprocessing meth- 
ods. More formally, the unidentifiable probe aifinity parameters fi, cancel out in 
RPA when the signal log-ratio between a user-specified 'reference' array and the 
remaining arrays is computed for each probe: the differential expression signal 
between arrays t = {1, . . . ,T} and the reference array c for probe j is obtained 
by rutj = Stj - Scj = gt - 9c + Stj - Scj = dt + £tj - £cj- In vector notation, 
the differential expression profile of probe j across the T arrays is then written as 
mj = d-t-Ej. i.e., a noisy observation of the true underlying differential expression 
signal d and probe-specific noise Sj. 

The unidentifiable probe affinity parameters cancel out in the RPA model of 
Publication 2. This can partly explain the previous empirical observations that 
calculating differential expression already at probe-level improves the analysis of 
differential gene expression (Zhang et al., 2002; Elo et al., 2005). However, the 
previous models are non-probabilistic preprocessing methods that do not aim at 
quantifying the uncertainty in the probes. Use of a single parameter for probe 
effects in RPA also gives more straightforward interpretations of probe reliability. 

Posterior estimates of the model parameters are derived to estimate probe 
reliability and differential gene expression. The differential expression vector d = 
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{dt} and the probe-specific variances = {r?} are estimated simultaneously. The 

posterior density of the model parameters is obtained from the likelihood of the 
data and the prior according to Bayes' rule (Equation 3.3) as 

p{d, T^lm) - p(m|d, T^)p{d, T^). (4.3) 

To obtain this posterior, let us consider the likelihood p(m|d,T^) of the data and 
the prior p(d, r^) of the model parameters. The noise on the selected control array 
Ecj is a latent variable, and marginalized out in the model to obtain the likelihood: 

p(m|d,T^)=JJ j N{mtj\dt - ecj,Tf)N{scj\0,T])decj 



tj 



(4.4) 



Let us assume independent priors, p(d, r^) = p(d)p(r^), flat non-informative prior 
p(d) ^ 1 and conjiigatc priors for the variance parameters in (inverse Gamma 
function, see Gelman et al. 2003). With these standard assumptions, the prior 
takes the form 

p(d,T2)~n/G(Tj;a„/3,), (4.5) 

3 

where aj and (ij are the shape and scale parameters of the inverse Gamma dis- 
tribution. Prior information of the probes can be incorporated in the analysis 
through these parameters. Probe- level differential expression is then described 
by two sets of parameters; the differential gene expression vector d = [di . . . d-r], 
and the probe-specific variances = [r^ . . . tJ] . High variance r? indicates that 
the probe-level observation iiij is strongly deviated from the estimated true signal 
d. Denoting oij = + ^ and 0j = + \ T,ti'^tj - dtf - |^Si(^£izM!^ the 
posterior of the model parameters in Equation 4.3 takes the form 

4- 



p{d, T^lm) ~ J](r2)-(«.+i)ea;p(-^). (4.6) 



3 

The formulation allows estimating the uncertainty in the expression estimates and 
probe-level parameters. In practice, a MAP point estimate of the parameters, 
obtained by maximizing the posterior, is often sufRcient. In the limit of a large 
sample size (T — >■ oo), the model will converge to estimating ordinary mean and 
variance parameters. With limited sample sizes that are typical in microarray 
studies the prior parameters provide regularization that makes the probabilistic 
formulation more robust to overfitting and local optima, compared to direct esti- 
mation of the mean and variance parameters. Moreover, the probabilistic analysis 
takes the uncertainty in the data and model parameters into account in an explicit 
manner. 

The model also provides a principled framework for incorporating prior know- 
ledge probe reliability in microarray preprocessing through the probe-specific hy- 
perparameters a, /3. Estimation and use of probe-specific effects from external 
microarray data collections has been previously suggested in the context of the 
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refRMA method by Katz et al. (2006), where such side information was shown to 
improve gene expression estimates. The RPA method of Pubhcation 2 provides 
an alternative probabilistic treatment. 

Model validation 

The probabilistic RPA model introduced in Publication 2 was validated by c;ompar- 
ing the preprocessing performance to other preprocessing methods, and addition- 
ally by comparing the estimates of probe-level noise to known sources of probe-level 
contamination. The comparison methods include the FARMS (Hochreiter, 2006), 
MASS (Hubbell et al., 2002), PECA (Publication 1), and RMA (Irizarry et al., 
2003a) preprocessing algorithms. FARMS has a more detailed model for probe 
effects than the other methods, and it contains implicitly a similar probe-specific 
variance parameter than our RPA model. FARMS is based on a factor analysis 
model, and is defined as Sij = ZiXj + jij + Sij, where Zi captures the underlying 
gene expression. In contrast to RMA and RPA that have a single probe-specific 
parameter, FARMS has three probe-specific parameters {Xj, HjjSij}. MAS5 is a 
standard preprocessing algorithm provided by the array manufacturer. The algo- 
rithm performs local background correction, utilizes so-called mismatch probes to 
control for non-specific hybridization, and scales the data from each array to the 
same average intensity level to improve comparability across arrays. MASS sum- 
marizes probc-lcvcl observations of absohitc gene expression levels using robust 
summary statistics, Tukey biweight estimate, but unlike FARMS, RMA and RPA, 
MASS does not model probe-specific effects. 

The preprocessing performance of these methods was investigated in spike-in 
experiments where certain target transcripts measured by the array have been 
spiked in at known concentrations, as well as on real data sets. The results from 
the spike-in experiments were compared in terms of receiver operating characteris- 
tics (ROC). The standard RMA, PECA (Publication 1) and RPA (Pubhcation 2) 
had comparable performance in spike-in data, and they outperformed the MASS 
(Hubbell et al., 2002) and FARMS (Hochreiter, 2006) preprocessing algorithms in 
estimating differential gene expression. On real data sets, PECA and RPA out- 
performed the other methods, providing higher reproducibility between technical 
replicates measured on difi^erent microarray platforms (Figure ??B). 

In contrast to standard preprocessing algorithms, RPA provides explicit quan- 
titative estimates of probe performance. The model has been validated on widely 
used human whole-genome arrays by comparing the estimates of probe reliability 
with known probe-level error sources: errors in probe- genome alignment, interro- 
gation position of a probe on the target sequence, GC-content, and the presence 
of SNPs in the probe target sequences; a good model for assessing probe reliabil- 
ity should detect probes contaminated by the known error sources. The results 
from our analysis can be used to characterize the relative contribution of differ- 
ent sources of probe- level noise (Figure ??B). In general, the probes with known 
sources of contamination were more noisy than the other probes, with 7-39% in- 
crease in the average variance, as detected by RPA. Any single source of error 
seems to explain only a fraction of the most highly contaminated probes. A large 
portion (3S-60%) of the detected least reliable probes were not associated with the 
investigated known noise sources. This suggests that previous methods that re- 
move probe-level noise based on external information, such as genomic alignments 
will fail to detect a significant portion of poorly performing probes. The RPA 
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model of Publication 2 provides rigorous algorithmic tools to investigate the vari- 
ous probe-level error sources. Better understanding of the factors affecting probe 
performance can advance probe design and contribute to reducing probe-related 
noise in future generations of gene expression arrays. 

4.4 Conclusion 

The contributions presented in this Chapter provide improved preprocessing strate- 
gies for differential gene expression studies. The introduced techniques utilize 
probe- level analysis, as well as side information in sequence and microarray data- 
bases. Probe-level studies have led to the establishment of probe verification and 
alternative microarray interpretations as a standard step in microarray prepro- 
cessing and analysis. The alternative interpretations for microarray data based on 
updated genomic sequence data (Gautier et al., 2004; Dai et al., 2005) are now im- 
plemented as routine tools in popular preprocessing algorithms such as the RMA, 
or the RPA method of Publication 2. The probe-level analysis strategy has been 
recently extended to exon array context, where expression levels of alternative 
splice variants of the same genes arc compared under particular experimental con- 
ditions. The probe-level approach has shown superior preprocessing performance 
also with exon arrays (Laajala et al., 2009). A convenient access to the algo- 
rithmic tools developed in Publications 1 and 2 for microarray preprocessing and 
probe-level analysis is provided by the accompanied open source implementation 
in BioConductor.^ 



http://www.bioconductor.org/packages/release/bioc/html/RPA. html 
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Global analysis of the human 
transcriptome 

When we try to pick out anything by itself, we find that it is bound fast 
by a thousand invisible cords that cannot be broken, to everything in the 
universe. 

J. Muir (1869) 

Measurements of transcriptional activity provide only a partial view to physio- 
logical processes, but their wide availability provides a unique resource for investi- 
gating gene activity at a genome- and organism-wide scale. Versatile and carefully 
controlled gene expression atlases have become available for normal human tissues, 
cancer as well as for other diseases (see, for instance, Kilpinen et al., 2008; Lukk 
et al., 2010; Roth et al., 2006; Su et al., 2004). These data sources contain valuable 
information about shared and unique mechanisms between disparate conditions, 
which is not available in smaller and more specific experiments (Lage et al., 2008; 
Scherf et al., 2000). While standard methods for gene expression analysis have 
focused on comparisons between particular conditions, versatile transcriptome at- 
lases allow for global organism- wide characterization of transcriptional activation 
patterns (Levine et al., 2006). Novel methodological approaches are needed in 
order to realize the full potential of these information sources, as many tradi- 
tional methods for expression analysis are not applicable to versatile large-scale 
collections. This chapter provides an overview to current approaches for global 
transcriptome analysis in Section 5.1 and introduces the second main contribution 
of the thesis, a novel exploratory approach that can be used to investigate context- 
specific responses in genome-scale interaction networks across organism-wide col- 
lections of measurement data in Section 5.2. The conclusions are summarized in 
Section 5.3. 

5.1 Standard approaches 

Global observations of transcriptional activity reflect known and previously un- 
characterized cell-biological processes. Exploratory analysis of the transcriptome 
can provide research hypotheses and material for more detailed investigations. 
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Widely-used standard approaches for global transcriptome analysis include various 

clustering, dimensionality reduction and visualization techniques (sec e.g. Hutten- 
hower and Hofmann, 2010; Polanski and Kimmel, 2007; Quackenbush, 2001). The 
large data collections open up new possibilities to investigate functional related- 
ness between physiological conditions, disease states, as well as cellular processes, 
and to discover previously uncharacterized connections and functional mechanisms 
(Bergmann et al., 2004; Kilpinen et al., 2008; Lukk et al., 2010). 

Gene expression studies have traditionally focused on the analysis of relatively 
small and targeted data sets, such as particular diseases or cell types. A typical ob- 
jective is to detect genes, or gene groups, that are differentially expressed between 
particular conditions, for instance to predict disease outcomes, or to identify poten- 
tially unknown disease subtypes. The increasing availability of large and versatile 
transcriptome collections that may cover thousands of experimental conditions al- 
lows global, data-driven analysis, and the formulation of novel research questions 
where the traditional analysis methods are often insufficient (Huttenhower and 
Hofmann, 2010). 

A variety of approaches have been proposed and investigated in the recent 
years in the global transcriptome analysis context. An actively studied modeling 
problem in transcriptome analysis is the discovery of transcriptional modules, i.e., 
identification of coherent gene groups that show coordinated transcriptional re- 
sponses under particular conditions (Segal et al., 2003a, 2004; Stuart et al., 2003). 
Models have also been proposed to predict gene regulators (Segal et al., 2003b), 
and to infer cellular processes and networks based on transcriptional activation 
patterns (Friedman, 2004; Segal et al., 2003c). An increasing number of models 
are being developed to integrate transcriptome measurements to other sources of 
genomic information, such as regulation and interactions between the genes to 
detect and characterize cellular processes and disease mechanisms (Barash and 
Friedman, 2002; Chari et al., 2010; Vaske et al., 2010). Findings from transcrip- 
tome analysis have potential biomedical implications, as in Lamb ct al. (2006), 
where chemically perturbed cancer cell lines were screened to enhance the detec- 
tion of drug targets based on shared functional mechanisms between disparate 
conditions, or in Sorlie et al. (2001), where cluster analysis of cancer patients 
based on genome-wide transcriptional profiling experiments led to the discovery 
of a novel breast cancer subtype. In the remainder of this section, the modeling 
approaches that are particularly closely related to the contributions of this thesis 
are considered in more detail. 

Investigating known processes 

A popular strategy for genome-wide gene expression analysis is to consider known 
biological processes and their activation patterns across diverse collections mea- 
surement data from various experimental conditions. Biomedical databases con- 
tain a variety of information concerning genes and their interactions. For in- 
stance, the Gene Ontology database (Ashburner et al., 2000) provides functional 
and molecular classifications for the genes in human and a number of other organ- 
isms. Other categories are based on micro- RNA regulation, chromosomal locations, 
chemical perturbations and other features (Subramanian et al., 2005). Joint anal- 
ysis of functionally related genes can increase the statistical power of the analysis. 
So-called gene set-based approaches are typically designed to test differential ex- 
pression between two particular conditions (Goeman and Buhlmann, 2007; Nam 
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and Kim, 2008), but they can also be used to build global maps of transcriptional 

activity of the known processes (Levine et al., 2006). However, gene set-based 
approaches typically ignore more detailed information of the interactions between 
individual genes. Pathway and interaction databases contain more detailed infor- 
mation concerning molecular interactions and cell-biological processes (Kanehisa 
et al., 2008; Vastrik et al., 2007). Network-based methods utilize relational infor- 
mation of the genes to guide expression analysis. For instance, Draghici et al. 
(2007) demonstrated that taking into account aspects of pathway topology, such 
as gene and interaction types, can improve the estimation of pathway activity be- 
tween two predefined conditions. Another recent approach which utilizes pathway 
topology in inferring pathway activity is PARADIGM (Vaske et al., 2010), which 
also integrates other sources of genomic information in pathway analysis. How- 
ever, these methods have been designed for the analysis of particular experimental 
conditions, rather than comprehensive expression atlases. MATISSE (Ulitsky and 
Shamir, 2007) is a network-based approach that searches for functionally related 
genes that are connected in the network, and have correlated expression profiles 
across many conditions. The potential shortcoming of this approach is that it as- 
sumes global correlation across all conditions between the interacting genes, while 
many genes can have multiple, context-sensitive functional roles. Different condi- 
tions induce different responses in the same genes, and the definition of 'gene set' 
is vague (Montaner et al., 2009; Nacu et al., 2007). Therefore methods have been 
suggested to identify 'key condition-responsive genes' of predefined gene sets (Lee 
et al., 2008), or to decompose predefined pathways into smaller and more specific 
functional modules (Chang et al., 2009). These approaches rely on predefined 
functional classifications for the genes. The data-driven analysis in Publication 3 
provides a complementary approach where the gene sets are learned directly from 
the data, guided by prior knowledge of genetic interactions. This avoids the need 
to refine suboptimal annotations, and enables the discovery of new processes. The 
findings demonstrate that simply measuring whether a gene set, or a network, 
is differentially expressed between particular conditions is often not sufficient for 
measuring the activity of cell-biological processes. Since gene fmiction and inter- 
actions are regulated in a context-specific manner, it is important to additionally 
characterize how, and in which conditions the expression changes. Global analysis 
of transcriptional activation patterns interaction networks, introduced in Publica- 
tion 3, can address such questions. 

Biclustering and subspace clustering 

Approaches that are based on previously characterized genes and processes are 
biased towards well-characterized phenomena. This limits their value in de novo 
discovery of functional patterns. Unsupervised methods provide tools for such 
analysis, but often with an increased computational cost and a higher proportion 
of false positive findings. 

Cluster analysis is widely used for unsupervised analysis of gene expression 
data, providing tools for class discovery, gene function prediction and for visualiza- 
tion purposes. Examples of widely used clustering approaches include hierarchical 
clustering and K-means (see e.g. Polanski and Kimmel, 2007). Glustering of pa- 
tient samples with similar expression profiles has led to the discovery of novel can- 
cer subtypes with biomedical implications (S0rlie et al., 2001); clustering of genes 
with coordinated activation patterns can be used, for instance, to predict novel 



44 



5.1. STANDARD APPROACHES 



functional associations for poorly characterized genes (Allocco et al., 2004). The 

self-organizing map (Kohonen, 1982, 2001) is a related approach that provides effi- 
cient tools to visualize high-dimensional data on lower-dimensional displays, with 
particular applications in transcriptional profiling studies (Tamayo et al., 1999; 
Toronen et al., 1999). The standard clustering methods are based on comparison 
of global expression patterns, and therefore arc relatively coarse tools for analyz- 
ing large transcriptome collections. Different genes respond in different ways, as 
well as in different conditions. Therefore it is problematic to find clusters in high- 
dimensional data spaces, such as in whole-genome expression profiling studies; 
different gene groups can reveal different relationships between the samples. De- 
tection of smaller, coherent subspaces with a particular structure can be useful in 
biomedical applications, where the objective is to identify sets of interesting genes 
for further analysis. Both genes and the associated conditions may be unknown, 
and the learning task is to detect them from the data. This can help, for instance, 
in identifying responses to drug treatments in particular genes (Ihmels et al., 2002; 
Tanay et al., 2002), or in identifying fimctionally coherent transcriptional modules 
in gene expression databases (Segal et al., 2004; Tanay et al., 2005). 

Subspace clustering methods (Parsons et al., 2004) provide a family of algo- 
rithms that can be used to identify subsets of dependent features revealing coher- 
ent clustering for the samples; this defines a subspace in the original feature space. 
Subspace clustering models are a special case of a more general family of biclus- 
tering algorithms (Madeira and Oliveira, 2004). Closely related models are also 
called co-clustering (Cho et al., 2004), two-way clustering Gad et al. (2000), and 
plaid models (Lazzeroni and Owen, 2002). Biclustering methods provide general 
tools to detect co-regulated gene groups and associated conditions from the data, 
to provide compact summaries and to aid interpretation of transcriptome data 
collections. Biclustering models enable the discovery of gene expression signa- 
tures (Hu et al., 2006) that have emerged as a central concept in global expression 
analysis context. A signature describes a co-expression state of the genes, asso- 
ciated with particular conditions. Established signatures have been found to be 
reliable indicators of the physiological state of a cell, and commercial signatures 
have become available for routine clinical practice (Nuyten and van de Vijver, 
2008). However, the established signatures are typically designed to provide op- 
timal classification performance between two particular conditions. The problem 
with the classification-based signatures is that their associations to the underlying 
physiological processes are not well understood (Lucas et al., 2009). In Publica- 
tion 3 the understanding is enhanced by deriving transcriptional signatures that 
are explicitly connected to well-characterized processes through the network. 

Role of side information 

Standard clustering models ignore prior information of the data, which could be 
used to supervise the analysis, to connect the findings to known processes, as well 
as to improve scalability. For instance, standard model-based feature selection, or 
subspace clustering techniques would consider all potential connections between 
the genes or features (Law et al., 2004; Roth and Lange, 2004). Without addi- 
tional constraints on the solution space they can typically handle at most tens 
or hundreds of features, which is often insufficient in high-throughput genomics 
applications. Use of side information in clustering can help to guide unsupervised 
analysis, for instance based on known or potential interactions between the genes. 
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This has been shown to improve the detection of functionally coherent gene groups 
(Hanisch et al., 2002; Shiga et al., 2007; Ulitsky and Shamir, 2007; Zhu at al., 

2005) . However, while these methods provide tools to cluster the genes, they 
do not model differences between conditions. Extensions of biclustering models 
that can utilize relational information of the genes include cMonkey (Reiss et al., 

2006) and a modified version of SAMBA biclustering (Tanay et al., 2004). How- 
ever, cMonkey and SAMBA are application-oriented tools that rely on additional, 
organism-specific information, and their implementation is currently not available 
for most organisms, including that of the human. Further application-oriented 
models for utilizing side information in the discovery of transcriptional modules 
have recently been proposed for instance by Savage et al. (2010) and Suthram ct al. 
(2010). Publication 3 introduces a complementary method where the exhaustively 
large search space is limited with side information concerning known relations be- 
tween the genes, derived from genomic interaction databascis. This is a general 
algorithmic approach whose applicability is not limited to particular organisms. 

Other approaches 

Prior information on the cellular networks, regulatory mechanisms, and gene func- 
tion is often available, and can help to construct more detailed models of gene 
function and network analysis, as well as to summarize functional aspects of ge- 
nomic data collections (Huttenhower et al., 2009; Segal et al., 2003b; Troyanskaya, 
2005). Versatile transcriptome collections also enable network reconstruction, i.e., 
de novo discovery (Lezon et al., 2006; Myers et al., 2005) and augmentation (Novak 
and Jain, 2006) of genetic interaction networks. Other methodological approaches 
for global transcriptome analysis arc provided by probabilistic latent variable mod- 
els (Rogers et al., 2005; Segal et al., 2003a), hierarchical Dirichlet process algo- 
rithms (Gerber et al., 2007), as well as matrix and tensor computations (Alter and 
Golub, 2005). These methods provide further model-based tools to identify and 
characterize transcriptional programs by decomposing gene expression data sets 
into smaller, functionally coherent components. 

5.2 Global modeling of transcriptional activity in 
interaction networks 

Molecular interaction networks cover thousands of genes, proteins and small mo- 
lecules. Coordinated regulation of gene function through molecular interactions 
determines cell function, and is reflected in transcriptional activity of the genes. 
Since individual processes and their transcriptional responses are in general un- 
known (Lee et al., 2008; Montaner et al., 2009), data-driven detection of condition- 
specific responses can provide an efficient proxy for identifying distinct transcript- 
ional states of the network with potentially distinct functional roles. While a 
number of methods have been proposed to compare network activation patterns 
between particular conditions (Draghici et al., 2007; Ideker et al., 2002; Cabusora 
et al., 2005; Noirel et al.. 2008), or to use network information to detect function- 
ally related gene groups (Segal et al., 2003d; Shiga et al., 2007; Ulitsky and Shamir, 

2007) , general-purpose algorithms for a global analysis of context-specific network 
activation patterns in a genome- and organism-wide scale have been missing. 
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A detected subnetwork with tissue-specific responses 




Figure 5.1: Organism- wide analysis of transcriptional responses in a human pathway interaction 
network reveals physiologically coherent activation patterns and condition-specific regulation. 
One of the subnetworks and its condition-specific responses, as detected by the NetResponse 
algorithm is shown in the Figure. The expression of each gene is visualized with respect to its 
mean level of expression across all samples. ©The Author 2010. Published by Oxford University 
Press. Reprinted with permission from Publication 3. 



Publication 3 introduces and validates two general-purpose algorithms that 
provide tools for global modeling of transcriptional responses in interaction net- 
works. The motivation is similar to biclustering approaches that detect function- 
ally coherent gene groups that show coordinated response in a subset of condi- 
tions (Madeira and Oliveira, 2004). The network ties the findings more tightly to 
cell-biological processes, focusing the analysis and improving interpretability. In 
contrast to previous network-based biclustering models for global transcriptome 
analysis, such as cMonkey (Reiss et al., 2006) or SAMBA (Tanay et al., 2004), 
the algorithms introduced in Publication 3 are general-purpose tools, and do not 
depend on organism-specific annotations. 



A two-step approach 

The first approach in Publication 3 is a straightforward extension of network-based 
gene clustering methods. In this two-step approach, the functionally coherent sub- 
networks, and their condition-specific responses are detected in separate steps. In 
the first step, a network-based clustering method is used to detect functionally co- 
herent subnetworks. In Publication 3, MATISSE, a state-of-the-art algorithm de- 
scribed in Ulitsky and Shamir (2007), is used to detect the subnetworks. MATISSE 
finds connected subgraphs in the network that have high internal correlations be- 
tween the genes. In the second step, condition-specific responses of each identified 
subnetwork are searched for by a nonparametric Gaussian mixture model, which 
allows a data-driven detection of the responses. However, the two-step approach, 
coined MATISSE-I-, can be suboptimal for detecting subnetworks with particular 
condition-specific responses. The main contribution of Publication 3 is to intro- 
duce a second general-purpose algorithm, coined NetResponse, where the detection 
of condition-specific responses is used as the explicit key criterion for subnetwork 
search. 
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The NetResponse algorithm 

The network-based search procedure introduced in Publication 3 searches for lo- 
cal subnetworks, i.e., functionally coherent network modules where the interacting 
genes show coordinated responses in a subset of conditions (Figure 5.1). Side in- 
formation of the gene interactions is used to guide modeling, but the algorithm 
is independent of predefined classifications for genes or measurement conditions. 
Transcriptional responses of the network are described in terms of subnetwork 
activation. Regulation of the subnetwork genes can involve simultaneous activa- 
tion and repression of the genes: sufficient amounts of mRNA for key proteins 
has to be available while interfering genes may need to be silenced. The model 
assumes that a given subnetwork n can have multiple transcriptional states, as- 
sociated with different physiological contexts. A transcriptional state is reflected 
in a unique expression signature s*^"\ a vector that describes the expression levels 
of the subnetwork genes, associated with the particular transcriptional state. Ex- 
pression of some genes is regulated at precise levels, whereas other genes fluctuate 
more freely. Given the state, expression of the subnetwork genes is modeled as a 
noisy observation of the transcriptional state. With a Gaussian noise model with 
covariancc the observation is described by x*^"^ ^ N{s'^"-\ E*^"^). A given sub- 
network can have i?^"^ latent transcriptional states indexed by r. In practice, the 
states, including their number are unknown, and they have to be estimated 
from the data. In a specific measurement condition, the subnetwork n can be in 
any one of the latent physiological states indexed by r. Associations between the 
observations and the underlying transcriptional states are unknown and they are 
treated as latent variables. Gene expression in subnetwork n is then modeled with 
a Gaussian mixture model: 



where each component distribution p is assumed to be Gaussian with parameters 
6r = {sr"\ S["^}. In practice, we assume a diagonal covariance matrix S^"-*, leav- 
ing the dependencies between the genes unmodeled within each transcriptional 
state. Use of diagonal covariances is justified by considerable gains in computa- 
tional efficiency when the detection of distinct responses is of primary interest. It 
is possible, however, that such simplified model will fail to detect certain subnet- 
works where the transcriptional levels of the genes have strong linear dependencies 
within the individual transcriptional states; signaling cascades could be expected 
to manifest such activation patterns, for instance. More detailed models of tran- 
scriptional activity could help to distinguish the individual states in particular 
when the transcriptional states are partially overlapping, but with increased com- 
putational cost. A particular transcriptional response is then characterized with 
the triple {sr"\ wf"^}. This defines the shape, fluctuations and frequency of 
the associated transcriptional state of subnetwork n. A posterior probability of 
each latent state can be calculated for each measurement sample from the Bayes' 
rule (Equation 3.3). The posterior probabilities can be interpreted as soft compo- 
nent memberships for the samples. A hard, deterministic assignment is obtained 
by selecting for each sample the component with the highest posterior probability. 

The remaining task is to identify the subnetworks having such distinct tran- 
scriptional states. Detection of the distinct states is now used as a search criterion 




(5.1) 



r=l 
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for the subnetworks. In order to achieve fast computation, an agglomerative proce- 
dure is used where interacting genes are gradually merged into larger subnetworks. 
Initially, each gene is assigned in its own singleton subnetwork. Agglomeration 
proceeds by at each step merging the two neighboring subnetworks where joint 

modeling of the genes leads to the highest improvement in the objective function 
value. Joint modeling of dependent genes reveals coordinated responses and im- 
proves the likelihood of the data in comparison with independent models, giving 
the first criterion for merging the subnetworks. However, increasing subnetwork 
size tends to increase model complexity and the possibility of overfitting, since the 
number of samples remains constant while the dimensionality (subnetwork size) 
increases. To compensate for this effect, the Bayesian information criterion (see 
Gclman ct al., 2003) is used to penalize increasing model complexity and to de- 
termine optimal subnetwork size. The final cost function for a subnetwork G is 
C(G) = —2C + qlog(N). where C is the (marginal) log-likelihood of the data, given 
the mixture model in Equation 5.1, q is the number of parameters and N denotes 
sample size. The algorithm then compares independent and joint models for each 
subnetwork pair that has a direct link in the network, and merges at each step the 
subnetwork minimizes the cost 

AC = -2(A,, - (A + Cj)) + (g,,, - (<z, + q,))log{N). (5.2) 

The iteration continues until no improvement is obtained by merging the sub- 
networks. The combination of modeling techniques yields a scalable algorithm for 
genome- and organism-wide investigations: First, the analysis focuses on those 
parts of the data that are supported by known interactions, which increases mod- 
eling power and considerably limits the search space. Second, the agglomerative 
scheme finds a fast approximative solution where at each step the subnetwork pair 
that leads to the highest improvement in cost function is merged. Third, an ef- 
ficient variational approximation is used to learn the mixture models (Kurihara 
et al., 2007b). Note that the algorithm does not necessarily identify a globally 
optimal solution. However, detection of physiologically coherent and reproducible 
responses is often sufficient for practical applications. 

Global view on netw^ork activation patterns 

The NetResponse algorithm introduced in Publication 3 was applied to investigate 
transcriptional activation patterns of a pathway interaction network of 1800 genes 
based on the KEGG database of metabolic pathways (Kanehisa et al., 2008) pro- 
vided by the SPIA package (Tarca et al., 2009) across 353 gene expression samples 
from 65 tissues. The two algorithms proposed in Publication 3, MATISSE-I- and 
NetResponse were shown to outperform an unsupervised biclustcring approach in 
terms of reproducibility of the finding. The introduced NetReponse algorithm, 
where the detection of transcriptional response patterns is used as a search crite- 
rion for subnetwork identification, was the best-performing method. The algorithm 
identified 106 subnetworks with 3-20 genes, with distinct transcriptional responses 
across the conditions. One of the subnetworks is illustrated in Figure 5.1; the 
other findings are provided in the supplementary material of Publication 3. The 
detected transcriptional responses were physiologically coherent, suggesting a po- 
tential functional role. The reproducibility of the responses was confirmed in an 
independent validation data set, where 80% of the predicted responses were de- 
tected (p < 0.05). The findings highlight context-specific regulation of the genes. 
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Some responses are shared by many conditions, while others are more specific 

to particular contexts such as the immune system, muscles, or the brain; related 
physiological conditions often exhibit similar network activation patterns. Tissue 
relatedness can be measured in terms of shared transcriptional responses of the 
subnetworks, giving an alternative formulation of the tissue connectome map sug- 
gested by Greco ct al. (2008) in order to highlight functional connectivity between 
tissues based on the number of shared differentially expressed genes. In Publica- 
tion 3, shared network responses are used instead of shared gene count. The use 
of co-regulated gene groups is expected to be more robust to noise than the use of 
individual genes. The analysis provides a global view on network activation across 
the normal human body, and can be used to formulate novel hypotheses of gene 
function in previously unexplored contexts. 

5.3 Conclusion 

Gene function and interactions are often subject to condition-specific regulation 
(Liang et al., 2006; Rachlin et al., 2006), but these have been typically studied 
only in particular experimental conditions. Organism-wide analysis can poten- 
tially reveal new functional connections and help to formulate novel hypotheses of 
gene function in previously unexplored contexts, and to detect highly specialized 
functions that are specific to few conditions. Changes in cell-biological condi- 
tions induce changes in the expression levels of co-regulated genes, in order to 
produce specific physiological responses, typically affecting only a small part of 
the network. Since individual processes and their transcriptional responses are in 
general unknown (Lee et al., 2008; Montaner et al., 2009), data-driven detection of 
condition-specific responses can provide an efficient proxy for identifying distinct 
transcriptional states of the network, with potentially distinct functional roles. 

Publication 3 provides efficient model-based tools for global, organism- wide dis- 
covery and characterization of context-specific transcriptional activity in genome- 
scale interaction networks, independently of predefined classifications for genes 
and conditions. The network is used to bring in prior information of gene func- 
tion, which would be missing in unsupervised models, and allows data-driven de- 
tection of coordinately regulated gene sets and their context-specific responses. 
The algorithm is readily applicable in any organism where gene expression and 
pairwise interaction data, including pathways, protein interactions and regulatory 
networks, arc available. It has therefore a considerably larger scope than previous 
network-based models for global transcriptome analysis, which rely on organism- 
specific annotations, but lack implementations for most organisms (Reiss et al., 
2006; Tanay et al., 2004). 

While biomedical implications of the findings require further investigation, the 
results highlight shared and reproducible responses between physiological condi- 
tions, and provide a global view of transcriptional activation patterns across the 
normal human body. Other potential applications for the method include large- 
scale screening of drug responses and disease subtype discovery. Implementation 
of the algorithm is freely available through BioGonductor.^ 



http://bioconductor.org/packages/devel/bioc/html/netresponse. html 



50 



Chapter 6 



Human transcriptome and 
other layers of genomic 
information 



The way to deal with the problem of big data is to beat it senseless with 
other big data. 

J. Quackenbush (2006) 

This chapter prcscints the third main contribution of the thesis, computational 
strategies to integrate measurements of human transcriptome to other layers of 
genomic information. Genomic, transcriptomic, proteomic, epigenomic and other 
sources of measurement data characterize different aspects of genome organiza- 
tion (Hawkins et al., 2010; Montaner and Dopazo, 2010; Sara et al., 2010); any 
single source provides only a limited view to the cellular system. Understanding 
functional organization of the genome and ultimately the cell function requires 
integration of data from the various levels of genome organization and model- 
ing of their dynamical interplay. Such an holistic approach, which is also called 
systems biology, is a key to understanding living organisms, which are "rich in 
emergent properties because forever new groups of properties emerge at every 
level of integration" (Mayr, 2004). Combining evidence across multiple sources 
can help to discover fimctional mechanisms and interactions, which are not seen 
in the individual data sets, and to increase statistical power in noisy and incom- 
plete high-throughput experiments (Huttenhower and Hofmann, 2010; Reed et al., 
2006). 

Integration of heterogeneous genomic data comes with a variety of technical 
and methodological challenges (Hwang et al., 2005; Troyanskaya, 2005), and the 
particular modeling approaches vary according to the analysis task and particular 
properties of the investigated measurement sources. Integrative studies have been 
limited by poor availability of co-occurring genomic observations, but suitable 
data sets are now becoming increasingly available in both in-house and public 
biomedical data repositories (The Cancer Genome Atlas Research Network, 2008). 
New observations highlight the need for novel, integrative approaches in functional 
genomics (Coe et al., 2008). Recent studies have proposed for instance methods to 
integrate epigenetic modifications (Sadikovic et al., 2008), micro- RNA (Qin, 2008), 
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transcription factor binding (Savage et al., 2010), as well as protein expression 

(Johnson ct al., 2008). Given the complex stochastic nature of biological systems, 
computational efficiency, robustness against uncertainty and interpretability of the 
results are key issues. Prior information of biological systems is often incomplete, 
and subject to high levels of uncontrolled variation and complex interdependencies 
between different parts of the cellular system (Troyanskaya, 2005). These issues 
emphasize the need for principled approaches requiring minimal prior knowledge 
about the data, as well as minimal model fitting procedures. Section 6.1 gives an 
overview of the standard models for high-throughput data integration methods, 
which have close connections to the modeling approaches developed in this work. 

6.1 Standard approaches for genomic data inte- 
gration 

The integrative approaches can be roughly classified in three categories: meth- 
ods that (i) combine statistical evidence across related studies in order to obtain 
more accurate inferences of target variables, (ii) utilize side information in order 
to guide the analysis of a single, primary data source, and (iii) detect and char- 
acterize dependencies between the measurement sources in order to discover new 
functional connections between the different layers of genomic information. The 
contributions in Chapters 4 and 5 are associated with the first two categories; the 
contributions presented in this chapter, the regularized dependency detection fra- 
mework of Publication 4, and associative clustering of Publications 5 and 6, belong 
to the third category. 

6.1.1 Combining statistical evidence 

The first general category of methods for genomic data integration consists of ap- 
proaches where evidence across similar studies is combined to increase statistical 
power, for instance by comparing and integrating data from independent microar- 
ray experiments targeted at studying the same disease. In Publications 2 and 3, 
joint analysis of a large number of commensurable microarray experiments, where 
the observed data is directly comparable between the arrays, helps to increase 
statistical power and to reveal weak, shared signals in the data that can not be 
detected in more restricted experimental setups and smaller datasets. 

However, the related observations are often not directly comparable, and fur- 
ther methodological tools arc needed for integration. Meta-analysis provides tools 
for such analysis (Ramasamy et al., 2008). Meta-analysis forms part of the microar- 
ray analysis procedure introduced in Publication 1, where methods to integrate 
related microarray measurements across different array platforms are developed. 
Meta-analysis emphasizes shared effects between the studies over statistical sig- 
nificance in individual experiments. In its standard form, meta-analysis assumes 
that each individual study measures the same target variable with varying lev- 
els of noise. The analysis starts from identifying a measure of effect size based 
on differences, means, or other summary statistics of the observations such as 
the Hedges' g, used in Publication 1. Weighted averaging of the effect sizes pro- 
vides the final, combined result. Weighting accounts for differences in reliability 
of the individual studies, for instance by emphasizing studies with large sample 
size, or low measurement variance. Averaging is expected to yield more accurate 
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estimates of the target variable than individual studies. This can be particularly 

useful when several studies with small sample sizes are available for instance from 
different laboratories, which is a common setting in microarray analysis context, 
where the data sets produced by individual laboratories are routinely deposited 
to shared community databases. Ultimately, the quality of meta-analysis results 
rests on the quality of the individual studies. Modeling choices, such as the choice 
of the effect size measure and included studies will affect the analysis outcome. 

Kernel methods (see e.g. Scholkopf and Smola, 2002) provide another widely 
used approach for integrating statistical evidence across multiple, potentially het- 
erogeneous measurement sources. Kernel methods operate on similarity matrices, 
and provide a natural framework for combining statistical evidence to detect sim- 
ilarity and patterns that are supported by multiple observations. The modeling 
framework also allows for efficient modeling of nonlinear feature spaces. 

Multi-task learning refers to a class of approaches where multiple, related mod- 
eling tasks are solved simultaneously by combining statistical power across the 
related tasks. A typical task is to improve the accuracy of individual classifiers by 
taking advantage of the potential dependencies between them (see e.g. Caruana, 
1997). 

6.1.2 Role of side information 

The second category of approaches for genomic data integration consists of meth- 
ods that are asymmetric by nature; integration is used to support the analysis 
of one, primary data source. Side information can be used, for instance, to limit 
the search space and to focus the analysis to avoid overfitting, speed up compu- 
tation, as well as to obtain potentially more sensitive and accurate findings (see 
e.g. Eisenstein, 2006). One strategy is to impose hard constraints on the model, or 
model family, based on side information to target specific research questions. In 
gene expression context, functional classifications or known interactions between 
the genes can be used to constrain the analysis (Goeman and Buhlmann, 2007; 
Ulitsky and Shamir, 2009). In factor analysis and mixed effect models, clinical an- 
notations of the samples help to focus the modeling on particular conditions (see 
e.g. Carvalho et al., 2008). Hard constraints rely heavily on the accuracy of side 
information. Soft, or probabilistic approaches can take the uncertainty in side in- 
formation into account, but they are computationally more demanding. Examples 
of such methods in the context of transcriptome analysis include for instance the 
supervised biclustering models, such as cMonkey and modified SAMBA, as well 
as other methods that guide the analysis with additional information of genes and 
regulatory mechanisms, such as transcription factor binding (Rciss et al., 2006; 
Savage et al., 2010; Tanay et al., 2004). Pubhcation 3 uses gene interaction net- 
work as a hard constraint for modeling transcriptional co-regulation of the genes, 
but the condition-specific responses of the detected gene groups are identified in 
an unsupervised manner. 

A complementary approach for utilizing side information of the experiments 
is provided by multi-way learning. A classical example is the analysis of variance 
(ANOVA), where a single data set is modeled by decomposing it into a set of basic, 
underlying effects, which characterize the data optimally. The effects are associ- 
ated with multiple, potentially overlapping attributes of the measurement samples, 
such as disease state, gender and age, which are known prior to the analysis. Tak- 
ing such prior knowledge of systematic variation between the samples into account 
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helps to increase modeling power and can reveal the attribute-specific effects. An 

interesting subtask is to model the interactions between the attributes, so-called 
interaction effects. These are manifested only with particular combinations of 
attributes, and indicate dependency between the attributes. For instance, simul- 
taneous cigarette smoking and asbestos exposure will considerably increase the 
risk of lung cancer, compared to any of the two risk factors alone (sec e.g. Nymark 
et al., 2007). Factor analysis is a closely related approach where the attributes, 
also called factors, are not given but instead estimated from the data. Mixed 
effect models combine the supervised and unsupervised approaches by incorporat- 
ing both fixed and random effects in the model, corresponding to the known and 
latent attributes, respectively (sec e.g. Carvalho ct al., 2008). The standard factor- 
ization approaches for individual data sets are related to the dependency-seeking 
approaches in Publications 4-6, where co-occurring data sources are decomposed 
in an imsupcrviscd manner into components that are maximally informative of the 
components in the other data set. 

6.1.3 Modeling of mutual dependency 

Symmetric models for dependency detection form the third main category of meth- 
ods for genomic data integration, as well as the main topic of this chapter. De- 
pendency modeling is used to distinguish the shared signal from dataset- specific 
variation. The shared effects are informative of the commonalities and intcrac;tions 
between the observations, and are often the main focus of interest in integrative 
analysis. This motivates the development of methods that can allocate computa- 
tional resources efficiently to modeling of the shared features and interactions. 

Multi-view learning is a general category of approaches for symmetric depen- 
dency modeling tasks. In multi-view learning, multiple measurement sources are 
available, and each source is considered as a difFcrcnt view on the same objects. The 
task is to enhance modeling performance by combining the complementary views. 
A classical example of such a model is canonical correlation analysis (Hotelling, 
1936). Related approaches that have recently been applied in functional genomics 
include for instance probabilistic variants of meta-analysis (Choi ct al., 2007; Con- 
Ion et al., 2007), generalized singular value decomposition (see e.g. Alter et al., 
2003; Berger et al., 2006) and simultaneous non- negative matrix factorization 
(Badea, 2008). 

The dependency modeling approaches in this thesis make an explicit distinc- 
tion between statistical representation of data and the modeling task. Let us 
denote the representations of two co-occurring multivariate observations, x and y, 
with /a;(x) and fy{y), respectively. The selected representations depend on the 
application task. The representation can be for instance used to perform feature 
selection as in canonical correlation analysis (CCA) Hotelling (1936), capture non- 
linear features in the data as in kernelized versions of CCA (see e.g. Yamanishi 
et al., 2003), or partition the data as in information bottleneck (Friedman et al., 
2001) and associative clustering (Publications 5-6). Statistical independence of the 
representations implies that their joint probability density can be decomposed as 
p(/a;(x), fy{y)) = p{fx{^))p{fy{y))- Deviations from this assumption indicate sta- 
tistical dependency. The representations can have a flexible parametric form which 
can be optimized by the dependency modeling algorithms to identify dependency 
structure in the data. 

Recent examples of such dependency-maximizing methods include probabilistic 
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canonical correlation analysis (Bach and Jordan, 2005), which has close theoretical 

connection to the regularized models introduced in Publication 4, and the asso- 
ciative clustering principle introduced in Publications 5-6. Canonical correlations 
and contingency table analysis form the methodological background for the con- 
tributions in Publications 4-6. In the remainder of this section these two standard 
approaches for dependency detection are considered more closely. 

Classical and probabilistic canonical correlation analysis 

Canonical correlation analysis (CCA) is a classical method for detecting linear 
dependencies between two multivariate random variables (Hotelling, 1936). While 
ordinary correlation characterizes the association strength between two vectors 
with paired scalar observations, CCA assumes paired vectorial values, and gener- 
alizes correlation to multidimensional sources by searching for maximally correlat- 
ing low-dimensional representation of the two sources, defined by linear projections 
Xva;, Yvj|. Miiltiplc projection components can be obtained itcrativcly, by finding 
the most correlating projection first, and then consecutively the next ones after 
removing the dependencies explained by the previous CCA components; the lower- 
dimensional representations arc defined by projections to linear hyperplanes. The 
model can be formulated as a generalized eigenvalue problem that has an analytical 
solution with two useful properties: the result is invariant to linear transforma- 
tions of the data, and the solution for any fixed number of components maximizes 
mutual information between the projections for Gaussian data (KuUback, 1959; 
Bach and Jordan, 2002). Extensions of the classical CCA include generalizations 
to multiple data sources (Kcttcnring, 1971; Bach and Jordan, 2002), regularized 
solutions with non- negative and sparse projections (Sigg et al., 2007; Archambeau 
and Bach, 2008; Witten et al., 2009), and non-linear extensions, for instance with 
kernel methods (Bach and Jordan, 2002; Yamanishi et al., 2003). Direct opti- 
mization of correlations in the classical CCA provides an efficient way to detect 
dependencies between data sources, but it lacks an explicit model to deal with the 
uncertainty in the data and model parameters. 

Recently, the classical CCA was shown to correspond to the ML solution of a 
particular generative model where the two data sets are assumed to stem from a 
shared Gaussian latent variable z and normally distributed data-sct-spccific noise 
(Bach and Jordan, 2005). Using linear assumptions, the model is formally defined 
as 



The manifestation of the shared signal in each data set can be different. This is pa- 
rameterized by Wx and Wy . Assuming a standard Gaussian model for the shared 
latent variable, z ^ A/'(0, 1) and data set-specific effects where £x ~ A/'(0, '^x) 
(and respectively for y), the correlation-maximizing projections of the traditional 
CCA introduced in Section 6.1 can be retrieved from the ML solution of the model 
(Archambeau et al., 2006; Bach and Jordan, 2005). The model decomposes the 
observed co-occurring data sets into shared and data set-specific components based 
on explicit modeling assumptions (Figure 6.1). The dataset-specific effects can also 
be described in terms of latent variables as £x = ^x^x and £y = ^yZy, allowing 
the construction of more detailed models for the dataset-specific effects (Klami and 




(6.1) 



55 



CHAPTER 6. HUMAN TRANSCRIPTOME AND OTHER LAYERS OF GENOMIC 
INFORMATION 



Kaski, 2008). The shared signal z is treated as a latent variable and marginalized 
out in the model, providing the marginal likelihood for the observations: 

p(X, Y|W, *) = y p{X, Y|Z, W, *)p(Z)dZ, (6.2) 

where ^ denotes the block-diagonal matrix of ^'x, 'ify, and W = [Wa;;Wy]. The 
probabilistic formulation of CCA has opened up a way to new probabilistic ex- 
tensions that can treat the modeling assumptions and uncertainties in the data in 
a more explicit and robust manner (Archambeau et al., 2006; Klami and Kaski, 
2008; Klami et al., 2010). 

The general formulation provides a flexible modeling framework, where differ- 
ent modeling assumptions can be used to adapt the models in different applications. 
The connection to classical CCA assumes full covariances for the dataset-specific 
effects. Simpler models for the dataset-specific effects will not distinguish between 
the shared and marginal effects as effectively, but they have fewer model param- 
eters that can potentially reduce overlearning and speed up computation. It is 
also possible to tune the dimensionality of the shared latent signal. Learning of 
lower- dimensional models can be faster and potentially less prone to overfitting. 
Interpretation of simpler models is also more straightforward in many applications. 
The probabilistic formulation allows rigorous treatment of uncertainties in the data 
and model parameters also with small sample sizes that are common in biomedical 
studies, and allows the incorporation of prior information through Bayesian priors, 
as in the regularized dependency detection framework introduced in Publication 4. 




Figure 6.1: A graphical representation of the generative shared latent variable model in Equa- 
tion (6.1). The latent source z is shared by observations x and y. The other effects that are 
specific to each observation are characterized by Zx and Zy , respectively. Gray shading indicates 
observed variables. 



Contingency table analysis 

Contingency table analysis is a classical approach used to study associations be- 
tween co-occurring categorical observations. The co-occurrences are represented 
by cross-tabulating them on a contingency table, the rows and columns of which 
correspond to the first and second set of features, respectively. Various tests are 
available for measuring dependency between the rows and columns of the table 
Yates (1934); Agresti (1992), including the classical Fisher test (Fisher, 1934), a 
standard tool for measuring statistical enrichment of functional categories in gene 
cluster analysis (Hosack et al., 2003). While the classical contingency table anal- 
ysis is used to measure dependency between co-occurring variables, more recent 
approaches use contingency tables to derive objective functions for dependency ex- 
ploration tasks. The associative clustering principle introduced in Publications 5-6 
is an example of such approach. 



56 



6.2. REGULARIZED DEPENDENCY DETECTION 



Other approaches that use contingency tabic dependencies as objective func- 
tions include the information bottleneck (IB) principle (Tishby et al., 1999) and 
discriminative clustering (DC) (Sinkkonen et al., 2002; Kaski et al., 2005). These 
are asymmetric, dependency-seeking approaches that can be used to discover clus- 
ter structure in a primary data such that it is maximally informative of another, 
discrete auxiliary variable. The dependency is represented on a contingency ta- 
ble, and maximization of contingency table dependencies provides the objective 
function for clustering. While the standard IB operates on discrete data, DC is 
used to discover cluster structure in continuous- valued data. The two approaches 
also employ different objective functions. In classical IB, a discrete variable X is 
clustered in such a way that the cluster assignments become maximally informa- 
tive of another discrete variable y. The complexity of the cluster assignments is 
controlled by minimizing the mutual information between the cluster indices and 
the original variables. The task is to find a partitioning X that minimizes the 
cost C{p(X.\X.)) — /(X; X) — ,i9/(X; Y), where /3 controls clustering resolution. In 
DC, mutual information is replaced by a Bayes factor between the two hypothe- 
ses of dependent and independent margins. The Bayes factor is asymptotically 
consistent with mutual information, but provides an unbiased estimate for limited 
sample size (see e.g. Sinkkonen et al., 2005). The standard information bottleneck 
and discriminative clustering are asymmetric methods that treat one of the data 
sources as the primary target of analysis. 

In contrast, the dependency maximization approaches considered in this thesis, 
the associative clustering (AC) and regularized versions of canonical correlation 
analysis are symmetric and they operate exclusively on continuous-valued data. 
CCA is not based on contingency table analysis, but it has close connections to 
the Gaussian IB (Chechik et al.. 2005) that seeks maxinial dependency between 
two sets of normally distributed variables. The Gaussian IB retrieves the same 
subspacc as CCA for one of the data sets. However, in contrast to the symmetric 
CCA model, Gaiissian IB is a directed method that finds dependency-maximizing 
projections for only one of the two data sets. The second dependency detection 
approach considered in this thesis, the associative clustering, is particularly related 
to the symmetric IB that finds two sets of clusters, one for each variable, which 
are optimally compressed presentations of the original data, and at the same time 
maximally informative of each other (Friedman et al., 2001). While the objective 
function in IB is derived from mutual information, AC uses the Bayes factor as an 
objective function in a similar manner as it is used in the asymmetric discriminative 
clustering. Another key difference is that while the symmetric IB operates on 
discrete data, AC employs contingency table analysis in order to discover cluster 
structure in continuous-valued data spaces. 

6.2 Regularized dependency detection 

Standard unsupervised methods for dependency detection, such as the canonical 
correlation analysis or the symmetric information bottleneck, seek maximal depen- 
dency between two data sets with minimal assumptions about the dependencies. 
The unconstrained models involve high degrees of freedom when applied to high- 
dimensional genomic observations. Such flexibility can easily lead to overfltting, 
which is even worse for more flexible nonparametric or nonlinear, kernel-based de- 
pendency discovery methods. Several ways to regularize the solution have been 
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suggested to overcome associated problems, for instance by imposing sparsity con- 
straints on the solution space (Bie and Moor, 2003; Vinod, 1976). 

In many applications prior information of the dependencies is available, or 
particular types of dependency are relevant for the analysis task. Such prior infor- 
mation can be used to reduce the degrees of freedom in the model, and to regularize 
dependency detection. In the cancer gene discovery application of Publication 4, 
DNA mutations are systematically correlated with transcriptional activity of the 
genes within the affected region, and identification of such regions is a biomcdi- 
cally relevant research task. Prior knowledge of chromosomal distances between 
the observations can improve the detection of the relevant spatial dependencies. 
However, principled approaches to incorporate such prior information in depen- 
dency modeling have been missing. Publication 4 introduces regularized models for 
dependency detection based on classical canonical correlation analysis (Hotelling, 
1936) and its probabilistic formulation (Bach and Jordan, 2005). The models are 
extended by incorporating appropriate prior terms, which are then used to reduce 
the degrees of freedom based on prior biological knowledge. 

Correlation-based variant 

In order to introduce the regularized dependency detection framework of Publica- 
tion 4, let us start by considering regularization of the classical correlation-based 
CCA. This searches for arbitrary linear projection vectors Vj,,Vj, that maximize 
the correlation between the projections of the data sets X, Y. Multiple projection 
components can be obtained iteratively, by finding the most correlating projec- 
tion first, and then consecutively the next ones after removing the dependencies 
explained by the previous CCA components. The procedure will identify maxi- 
mally dependent linear subspaces of the investigated data sets. To regularize the 
solution. Publication 4 couples the projections through a transformation matrix 
T in such a way that Vy = Tv^. With a completely unconstrained T the model 
reduces to the classical unconstrained CCA; suitable constraints on can be used 
to regularize dependency detection. 

To enforce regularization one could for instance prefer solutions for T that are 
close to a given transformation matrix, T ~ M, or impose more general constraints 
on the structure of the transformation matrix that would prefer particular rota- 
tional or other linear relationships. Suitable constraints depend on the particular 
applications; the solutions can be made to prefer particular types of dependency 
in a soft manner by appropriate penalty terms. In Publication 4 the completely 
unconstrained CCA model has been compared with a fully regularized model with 
T = I; this encodes the biological assumption that probes with small chromosomal 
distances tend to capture more similar signal between gene expression and copy 
number measurements than probes with a larger chromosomal distance; the pro- 
jection vectors characterize this relationship, and are therefore expected to have 
similar form, ^ "Vy Utilization of other, more general constraints in related 
data integration tasks provides a promising topic for future studies. 

The correlation-based treatment provides an intuitive and easily implementable 
formulation for regularized dependency detection. However, it lacks an explicit 
model for the shared and data-specific effects, and it is likely that some of the 
dataset-specific effects are captured by the correlation-maximizing projections. 
This is suboptimal for characterizing the shared effects, and motivates the proba- 
bilistic treatment. 
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Probabilistic dependency detection with similarity constrednts 

The probabilistic approach for regularized dependency detection in Publication 4 
is based on an explicit model of the data-generating process formulated in Equa- 
tion (6.1). In this model, the transformation matrices W^;, Wy specify how the 
shared latent variable Z is manifested in each data set X, Y, respectively. In 
the standard model, the relationship between the transformation matrices is not 
constrained, and the algorithm searches for arbitrary linear transformations that 
maximize the likelihood of the observations in Equation (6.2). The probabilistic 
formulation opens up possibilities to guide dependency search through Bayesian 
priors. 

In Publication 4, the standard probabilistic CCA model is extended by incorpo- 
rating additional prior terms that regularize the relationship by reparameterizing 
the transformation matrices as Wy = TW^;, and setting a prior on T. The treat- 
ment is analogous to the correlation-based variant, but now the transformation 
matrices operate on the latent components, rather than the observations. This 
allows to distinguish the shared and dataset-specific effects more explicitly in the 
model. The task is then to learn the optimal parameter matrix W = [Wj,; Wy], 
given the constraint Wy = TW^,. The Bayes' rule gives the model likelihood 

p(X, Y, W, *) ~ p(X, Y|W, *)p(W, *). (6.3) 

The likelihood term p(X, Y| W, *) can be calculated based on the model in Equa- 
tion (6.1). This defines the objective function for standard probabilistic CCA, 
which implicitly assumes a flat prior p(W, '4') ~ 1 for the model parameters. The 
formulation in Equation (6.3) makes the choice of the prior explicit, allowing modi- 
fications on the prior term. To obtain a tractable prior, let us assume that the prior 
factorizes as p(W, *) = p(W)p(*). The first term can be further decomposed as 
p(W) ~ p{'Wx)p{T), assuming independent priors for Wj, and T. A convenient 
and tractable prior for T is provided by the matrix normal distribution:^ 

p(T)=AC(T|M,U,V). (6.4) 

For computational simplicity, let us assume independent rows and columns with 
U = V = (TtI. The mean matrix M can be used to emphasize certain types of 
dependency between W^, and Wy. Assuming uninformative, flat priors p(Wa;) ~ 1 
and f)(*) ~ 1, as in the standard probabilistic CCA model, and denoting S = 
WW-^ -|- the negative log-likelihood of the model is 

II T — M IP 

-logp{X, Y, W, *) - logm + TrU-^ll + " (6.5) 

This is the objective function to minimize. Note that this has the same form as the 
objective function of the standard probabilistic CCA, except the additional penalty 

||r|-i J^ll 2 

term " arising from the prior p(T). This yields the cost function employed 

in Publication 4. In our cancer gene discovery application the choice M I is used 
to encode the biological prior constrain T » I, which states that the observations 
with a small chromosomal distance should on average show similar responses in 
the integrated data sets, i.e., W^, w Wy. The regularization strength can be tuned 

Wm(T|M,U, V) ~ exp (-i7V{U-i(T - M)V-i{T - M)'^}) where M is the mean matrix, 
and U and V denote row and column covariances, respectively. 
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with af.. A fully regularized model is obtained with af. 0. When — >■ oc, 

and Wj; become independent a priori, yielding the ordinary probabilistic CCA. 
The can be used to regularize the solution between these two extremes. Note 
that it is possible to incorporate also other types of prior information concerning 
the dependencies into the model through p(T). 

The model parameters W, ^ are estimated with the EM algorithm. The 
regularized version is not analytically tractable with respect to W in the general 
case, but can be optimized with standard gradient-based optimization techniques. 
Special cases of the model have analytical solutions, which can speed up the model 
fitting procedure. In particular, the fully regularized and unconstrained models, 
obtained with a'^ = and a'^ = oo respectively, have closed-form solutions for 
W. Note that the current formulation assumes that the regularization parameters 
M, (t|. are defined prior to the analysis. Alternatively, these parameters could be 
optimized based on external criteria, such as cancer gene detection performance 
in our application, or learned from the data in a fully Baycsian treatment these 
parameters would be treated as latent variables. Incorporation of additional prior 
information of the data set-specific effects through priors on W^; and ^ provides 
promising lines for further work. 

6.2.1 Cancer gene discovery with dependency detection 

The regularized models provide a principled framework for studying associations 
between transcriptional activity and other regulatory layers of the genome. In 
Publication 4, the models are used to investigate cancer mechanisms. DNA copy 
number changes are a key mechanism for cancer, and integration of copy number 
information with mRNA expression measurements can reveal functional effects of 
the mutations. While causation may be difficult to grasp, study of the dependen- 
cies can help to identify functionally active mutations, and to provide candidate 
biomarkers with potential diagnostic, prognostic and clinical impact in cancer 
studies. 

The modeling task in the cancer gene discovery application of Publication 4 is 
to identify chromosomal regions that show exceptionally high levels of dependency 
between gene copy number and transcriptional levels. The model is used to detect 
dependency within local chromosomal regions that are then compared in order 
to identify the exceptional regions. The dependency is quantified within a given 
region by comparing the strength of shared and data set-specific signal. High 
scores indicate regions where the shared signal is particularly high relative to the 
data-set-specific effects. A sliding- window approach is used to screen the genome 
for dependencies. The regions are defined by the d closest probes around each 
gene. Then the dimensionality of the models stays constant, which allows direct 
comparison of the dependency measures between the regions without additional 
adjustment terms that would be otherwise needed to compensate for differences 
in model complexity. 

Prior information of the dependencies is used to regularize cancer gene detec- 
tion. Chromosomal gains and losses are likely to be positively correlated with the 
expression levels of the affected genes within the same chromosomal region or its 
close proximity; copy number gain is likely to increase the expression of the asso- 
ciated genes whereas deletion will block gene expression. The prior information 
is encoded in the model by setting M = I in the prior term p(T). This accounts 
for the expected positive correlations between gene expression and copy number 
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within the investigated chromosomal region. Regularization based on such prior 

information is shown to improve cancer gene detection performance in Publica- 
tion 4, where the regularized variants outperformed the unconstrained models. 

A genome- wide screen of 51 gastric cancer patients (Myllykangas et al., 2008) 
reveals clear associations between DNA copy number changes and transcriptional 
activity. The Figure 6.2 illustrates dependency detection on chromosome arm 
17q, where the regularized model reveals high dependency between the two data 
sources in a known cancer-associated region. The regularized and unconstrained 
models were compared in terms of receiver-operator characteristics calculated by 
comparing the ordered gene list from the dependency screen to an expert-curated 
list of known genes associated with gastric cancer (Myllykangas et al., 2008). A 
large proportion of the most significant findings in the whole-genome analysis were 
known cancer genes; the remaining findings with no known associations to gastric 
cancer are promising candidates for further study. 

Biomedical interpretation of the model parameters is also straightforward. A 
ML estimate of the latent variable values Z characterizes the strength of the shared 
signal between DNA mutations and transcriptional activity for each patient. This 
allows robust identification of small, potentially unknown patient subgroups with 
shared amplification effects. These would remain potentially undetected when 
comparing patient groups defined based on existing clinical annotations. The pa- 
rameters in W can downweigh signal from poorly performing probes in each data 
set, or probes that measure genes whose transcriptional levels are not functionally 
affected by the copy number change. This provides tools to distinguish between 
so-called driver mutations having functional effects from less active passenger mu- 
tations, which is an important task in cancer studies. On the other hand, the 
model can combine statistical power across the adjacent measurement probes, and 
it captures the strongest shared signal in the two sets of observations. This is 
useful since gene expression and copy number data are typically characterized by 
high levels of biological and measurement variation and small sample size. 

Related approaches 

Integration of chromosomal aberrations and transcriptional activity is an actively 
studied data integration task in functional genomics. The first studies with stan- 
dard statistical tests were carried out by Hyman et al. (2002) and Phillips et al. 
(2001) when simultaneous genome-wide observations of the two data sources had 
become available. The modeling approaches utilized in this context can be roughly 
classified in regression-based, correlation-based and latent variable approaches. 
The regression-based models (Adler et al., 2006; Bicciato et al., 2009; van Wierin- 
gen and van de Wiel, 2009) characterize alterations in gene expression Icivels based 
on copy number observations with multivariate regression or closely related mod- 
els. The correlation-based approaches (Gonzalez et al., 2009; Schafer et al., 2009; 
Soneson et al., 2010) provide symmetric models for dependency detection, based 
on correlation and related statistical models. Many of these methods also reg- 
ularize the solutions, typically based on sparsity constraints and non-negativity 
of the projections (Le Cao et al., 2009; Waaijenborg et al., 2008; Witten et al., 
2009; Parkhomenko et al., 2009). The correlation-based approach in Publication 4 
introduces a complementary approach for regularization that constrains the re- 
lationship between subspaces where the correlations are estimated. The latent 
variable models by Berger et al. (2006); Shen et al. (2009); Vaske et al. (2010), 
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Gene expression (17q) 




Nucleotide position (Mbp) 



Figijre 6.2: Gene expression, copy number signal, and the dependency score along the chromo- 
some arm 17q obtained with the regularized latent variable framework in Equation 6.5. Known 
cancer-associated genes from an expert-curated list are marked with black dots. 



and Publication 4 are based on explicit modeling assumptions concerning the data- 
generating processes. The iCluster algorithm (Shen et al., 2009) is closely related 
to the latent variable model considered in Publication 4. While our model detects 
contimious dependencies, iCluster uses a discrete latent variable to partition the 
samples into distinct subgroups. The iCluster model is regularized by sparsity 
constraints on W, while we tune the relationship between W^; and Wy. More- 
over, the model in Publication 4 utilizes full covariance matrices to model for the 
dataset-specific effects, whereas iCluster uses diagonal covariances. The more de- 
tailed model for dataset-specific effects in our model should help to distinguish 
the shared signal more accurately. Other latcint variable approaches include the 
iterative method based on generalized singular- value decomposition (Berger et al., 
2006), and the probabilistic factor graph model PARADIGM (Vaske et al., 2010), 
which additionally utilizes pathway topology information in the modeling. 

Experimental comparison between the related integrative approaches can be 
problematic since they target related, but different research questions where the 
biological ground truth is often unknown. For instance, some methods utilize pa- 
tient class information in order to detect class-specific alterations (Schafer et al., 
2009), other methods perform de novo class discovery (Shen et al., 2009), provide 
tools for gene prioritization (Salari et al., 2010), or guide the analysis with ad- 
ditional functional information of the genes (Vaske et al., 2010). The algorithms 
introduced in Publication 4 are particularly useful for gene prioritization and class 
discovery purposes, where the target is to identify the most promising cancer gene 
candidates for further validation, or to detect potentially novel cancer subtypes. 
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However, while an increasing number of methods are released as conveniently ac- 
cessible algorithmic tools (Salari et al., 2010; Shen et al., 2009; Schafer et al., 
2009; Witten et al., 2009), implementations of most models are not available for 
comparison purposes. Open source implementations of the dependency detection 
algorithms developed in this thesis have been released to enhance transparency 
and reproducibility of the computational experiments and to encourage further 
use of these models (Huovilainen and Lahti, 2010). 

6.3 Associative clustering 

Functions of human genes are often studied indirectly, by studying model organ- 
isms such as the mouse (Davis, 2004; Joyce and Palsson, 2006). Orthologs are 
genes in different species that originate from a single gene in the last common 
ancestor of these species. Such genes have often retained identical biological roles 
in the present-day organisms, and are likely to share the fimction (Fitch, 1970). 
Mutations in the genomic DNA sequence are a key mechanism in evolution. Con- 
sequently, DNA sequence similarity can provide hypotheses of gene function in 
poorly annotated species. An exceptional level of conservation may highlight crit- 
ical physiological similarities between species, whereas divergence can indicate sig- 
nificant evolutionary changes (Jordan et al., 2005). Investigating evolutionary con- 
servation and divergence will potentially lead to a deeper understanding of what 
makes each species unique. Evolutionary changes primarily target the structure 
and sequence of genomic DNA. However, not all changes will lead to phenotypic 
differences. On the other hand, sequence similarity is not a guarantee of func- 
tional similarity because small changes in DNA can potentially have remarkable 
functional implications. 

Therefore, in addition to investigating structural conservation of the genes at 
the sequence level, another level of investigation is needed to study functional con- 
servation of the genes and their regulation, which is reflected at the transcriptome 
(Jimenez et al., 2002; Jordan et al., 2005). Transcriptional regulation of the genes 
is a key regulatory mechanism that can have remarkable phenotypic consequences 
in highly modular cell-biological systems (Hartwell et al., 1999) even when the 
original function of the regulated genes would remain intact. 

Systematic comparison of transcriptional activity between different species 
would provide a straightforward strategy for investigating conservation of gene 
regulation (Bergmann et al., 2004; Enard et al., 2002; Zhou and Gibson, 2004). 
However, direct comparison of individual genes between species may not be op- 
timal for discovering subtle and complex dependency structures. The associative 
clustciring principle (AC), introduced in Publications 5-6, provides a framework 
for detecting groups of orthologous genes with exceptional levels of conservation 
and divergence in transcriptional activity between two species. While standard 
dependency detection methods for continuous data, such as the generalized sin- 
gular value decomposition (see e.g. Alter et al., 2003) or canonical correlation 
analysis (Hotelling, 1936) detect global linear dependencies between observations, 
AC searches for dependent, local groupings to reveal gene groups with exceptional 
levels of conservation and divergence in transcriptional activity. The model is free 
of particular distributional assumptions about the data, which helps to allocate 
modeling resources to detecting dependent subgroups when variation within each 
group is less relevant for the analysis. The remainder of this section provides 
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Figure 6.3: Principle of associative clustering (AC). AC performs simultaneous clustering of two 
data sets, consisting of paired observations, and seeks to maximize the dependency between the 
two sets of clusters. The clusters are defined by cluster centroids in each data space. The cluster- 
ing results are represented on a contingency table, where clusters of the two data sets correspond 
with the rows and columns of the contingency table, respectively. These are called the mar- 
gin clusters of the contingency table. The table cells are called cross clusters and they contain 
orthologous genes from the two data sets. The cluster centroids are optimized to produce a con- 
tingency table with maximal dependency between the margin cluster counts. Cross clusters that 
show significant deviation from the null hypothesis of independent margins indicate dependency. 
In order to enhance the reliability of the results, the clustering is repeated with slightly differing 
bootstrap samples. Then reliable co-occurrences are identified from a co-occurrence tree with a 
specified threshold. Frequently co-occurring orthologues are selected for further analyzes. 



an overview of the associative clustering principle and its application to studying 
evolutionary divergence between species. 

The associative clustering principle 

The principle of associative clustering (AC) is illustrated in Figure ??. AC per- 
forms simultaneous clustering of two data sets to reveal maximally dependent 
cluster structure between two sets of observations. The clusters are defined in 
each data space by Voronoi parameterization, where the clusters are defined by 
cluster centroids to produce connected, internally homogeneous clusters. Let us 
denote the two sets of clusters by {v}^^}i, {Vj^^}j. A given data point x is then 
assigned to the cluster corresponding to the nearest centroid nij in the feature 
space, with respect to a given distance measure^ d. This divides the space into 
non-overlapping Voronoi regions. The regions define a clustering for all points of 
the data space. The association between the clusters of the two data sets can be 

e V^-^^ if d{x, mi) < (i(x, nn,) for all k. 
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represented on a contingency table, where the rows and columns correspond to 

clusters in the first and second data set, respectively. The clusters in each data set 
are called margin clusters. Each pair of co-occurring observations (xi,yi) maps to 
one margin cluster in each data set, and each contingency table cell corresponds 
to a pair of margin clusters. These are called cross clusters. 

AC searches for a maximally dependent cluster structure by optimizing the 
Voronoi centroids in the two data spaces in such a way that the dependency be- 
tween the contingency table margins is maximized. Let us denote the number 
of samples in cross cluster i,j by n^ . The corresponding margin cluster counts 
are rij. = J2j ''^ij ^^'^ ^-i = J2i ''^ij ■ The observed sample frequencies over the 
contingency table margins and cross-clusters are assumed to follow multinomial 
distribution with latent parameters 9i,9j and Oij, respectively. Assuming the 
model Mi of independent m,argin clusters^ the expected sample frequency in each 
cross cluster is given by the outer product of margin cluster frequencies. The model 
Md of dependent margin clusters deviates from this assumption. The Bayes factor 
(BF) is used to compare the two hypotheses of dependent and independent mar- 
gins. This is a rigorously justified approach for model comparison, which indicates 
whether the observations provide superior evidence for either model. Evidence is 
calculated over all potential values of the model parameters, marginalized over the 
latent frequencies. In a standard setting, the Bayes factor would be used to com- 
pare evidence between the dependent and independent margin cluster models for 
a given clustering solution. AC uses the Bayes factor in a non-standard manner; 
as an objective function to maximize by optimizing the cluster centroids in each 
data space; the centroids define the margin clusters and consequently the margin 
cluster dependencies. 

The centroids are optimized with a conjugate-gradient algorithm after smooth- 
ing the cluster borders with continuous parameterization. The hyperparameters 
jj{d) ^ ^ g^j^j jj(y) arise from Dirichlet priors of the two multinomial models M/, 
Mu of independent and dependent margins, respectively. Setting the hyperpa- 
rameters to unity yields the classical hypergeometric measure of contingency table 
dependency (Fisher, 1934; Yates, 1934). With large sample size, the logarithmic 
Bayes factor approaches mutual information (Sinkkonen et al., 2005). The Bayes 
factor is a desirable choice especially with a limited sample size since a marginaliza- 
tion over the latent variables makes it robust against uncertainty in the parameter 
values, and because finite contingency table counts would give a biased estimate 
of mutual information. The number of clusters in each data space is specified in 
advance, typically based on the desired level of resolution. Nonparametric exten- 
sions, where the number of margin clusters would be inferred automatically from 
the data form one potential topic for further studies; a closely related approach 
was recently proposed in Rogers et al. (2010). 

Publication 6 introduces an additional, bootstrap-based procedure to assess 
the reliability of the findings (Figure ??). The analysis is repeated with similar, 
but not identical training data sets obtained by sampling the original data with 
replacement. The most frequently detected dependencies are then investigated 
more closely. The analysis will emphasize findings that are not sensitive to small 
variations in the observed data. 
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Compeirison methods 

Associative clustering was compared with two alternative methods: standard K- 
means on each of the two data sets, and a combination of K-mcans and informa- 
tion bottleneck (K-IB). K- means (see e.g. Bishop, 2006) is a classical clustering 
algorithm that provides homogeneous, connected clusters based on Voronoi pa- 
rameterization. Homogeneity is desirable for interpretation, since the data points 
within a given cluster can then be conveniently summarized by the cluster cen- 
troid. On the other hand, K-means considers each data set independently, which 
is suboptimal for the dependency modeling task. The two sets of clusters obtained 
by K-means, one for each data space, can then be presented on a contingency 
table as in associative clustering. The second comparison method is K-IB intro- 
duced in Publication 5. K-IB uses K-mcans to partition the two co-occurring, 
continuous-valued data sets into discrete atomic regions where each data point is 
assigned in its own singleton cluster. This gives two sets of atomic clusters that 
are mapped on a large contingency table, filled with frequencies of co-occurring 
data pairs {xk,yk)- The table is then compressed to the desired size by aggre- 
gating the margin clusters with the symmetric IB algorithm in order to maximize 
the dependency between the contingency table margins (Friedman et al., 2001). 
Aggregating the atomic clusters provides a flexible clustering approach, but the 
resulting clusters are not necessarily homogeneous and they are therefore difficult 
to interpret. 

AC compared favorably to the other methods. While AC outperformed the 
standard K-means in dependency modeling, the cluster homogeneity was not sig- 
nificantly reduced in AC. The cross clusters from K-IB (Sinkkoncn ct al., 2003) 
were more dependent than in AC. On the other hand, AC produced more easily in- 
terpretable localized clusters, as measured by the sum of intra-cluster variances in 
Publication 6. Homogeneity makes it possible to summarize clusters conveniently, 
for instance by using the mean expression profiles of the cluster samples, as in 
Figure 6.4B. While K-means searches for maximally homogeneous clusters and K- 
IB searches for maximally dependent clusters, AC finds a successful compromise 
between the goals of dependency and homogeneity. 

6.3.1 Exploratory analysis of transcriptional divergence be- 
tween species 

Associative clustering is used in Publications 5 and 6 to investigate conservation 
and divergence of transcriptional activity of 2818 orthologous human-mouse gene 
pairs across an organism-wide collection of transcriptional profiling data covering 
46 and 45 tissue types in human and mouse, respectively (Su et al., 2002). AC takes 
as input two gene expression matrices with orthologous genes, one for each species, 
and returns a dependency-maximizing clustering for the orthologous gene pairs. 
Interpretation of the results focuses on unexpectedly large or small cross clusters 
revealed by the contingency table analysis of associative clustering. Compared to 
plain correlation-based comparisons between the gene expression profiles, AC can 
reveal additional cluster structure, where genes with similar expression profiles 
are clustered together, and associations between the two species are investigated 
at the level of such detected gene groups. The dependency between each pair of 
margin clusters can be characterized by comparing the respective margin cluster 
centroids that provide a compact summary of the samples within each cluster. 
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Figure 6.4: A The contingency table of associative clustering highlights orthologous gene groups 
in human (rows) and mouse (columns) with exceptional levels of conservation (yellow) or diver- 
gence (blue) in transcriptional activity between the two species. B Average expression profiles 
of a highly conserved group of testis-specific genes across 21 tissues in man and mouse. ©IEEE. 
Reprinted with permission from Publication 6. 



Biological interpretation of the findings, based on enrichment of Gene Ontology 
(GO) categories (Ashburner et al., 2000), revealed genes with strongly conserved 
and potentially diverged transcriptional activity. The most highly enriched cat- 
egories were associated with ribosomal functions, the high conservation of which 
has also been suggested in earlier studies (Jimenez et al., 2002); ribosomal genes 
often require coordinated effort of a large group of genes, and they function in cell 
maintenance tasks that are critical for species survival. An exceptional level of 
conservation was also observed in a group of testis-specific genes, yielding novel 
functional hypotheses for certain poorly annotated genes within the same cross- 
cluster (Figure 6.4). Transcriptional divergence, on the other hand, was detected 
for instance in genes related to embryonic development. 

While general-purpose dependency exploration tools may not be optimal for 
studying the specific issue of transcriptional conservation, such tools can reveal de- 
pendency with minimal prior knowledge about the data. This is useful in functional 
genomics experiments where little prior knowledge is available. In Publications 5 
and 6, associative clustering has been additionally applied in investigating depen- 
dencies between transcriptional activity and transcription factor binding, another 
key regulatory mechanism of the genes. 



6.4 Conclusion 



The models introduced in Publications 4-6 provide general exploratory tools for 
the discovery and analysis of statistical dependencies between co-occurring data 
sources and tools to guide modeling through Bayesian priors. In particular, the 
models consider linear dependencies (Publication 4) and cluster-based dependency 
structures (Publications 5-6) between the data sources. The models are readily 
applicable to data integration tasks in functional genomics. In particular, the mod- 
els have been applied to investigate dependencies between chromosomal mutations 
and transcriptional activity in cancer, and evolutionary divergence of transcript- 
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ional activity between human and mouse. Biomedical studies provide a number 

of other potential applications for such general-purpose methods. An increasing 
number of co-occurring observations across the various regulatory layers of the 
genome are available concerning epigenetic mechanisms, micro-RNAs, polymor- 
phisms and other genomic features (The Cancer Genome Atlas Research Network, 
2008). Simultaneous observations provide a valuable resource for investigating the 
functional properties that emerge from the interactions between the different lay- 
ers of genomic information. An open source implementation in BioConductor'^ 
provides accessible computational tools for related data integration tasks, helping 
to guarantee the utility of the developed models for the computational biology 
community. 



^http: / / www.bioconductor.org/ packages/release/bioc/html/pint .html 
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Chapter 7 

Summary and conclusions 



MathemMics is biology's next microscope, only better; biology is math- 
ematics ' next physics, only better. 

J.E. Cohen (2004) 

Following the initial sequencing of the human genome (International human 
genome sequencing consortium, 2001; Venter et al., 2001), the understanding of 
structural and functional organization of genetic information has extended rapidly 
with the accumulation of research data. This has opened up new challenges and 
opportunities for making fundamental discoveries about living organisms and cre- 
ating a holistic picture about genome organization. The increasing need to or- 
ganize the large volumes of genomic data with minimal human intervention has 
made computation an increasingly central element in modern scientific inquiry. It 
is a paradox of our time that the historical scale of data in public and proprietary 
repositories is only revealing how incomplete our knowledge of the enormous com- 
plexity of living systems is. The particular challenges in data-intensive genomics 
are associated with the complex and poorly characterized nature of living systems, 
as well as with limited availability of observations. It is possible to solve some of 
these challenges by combining statistical power across multiple experiments, and 
utilizing the wealth of background information in public repositories. Exploratory 
data analysis can help to provide research hypotheses and material for more de- 
tailed investigations based on large-scale genomic observations when little prior 
knowledge is available concerning the underlying phenomena; models that are ro- 
bust to uncertainty and able to automatically adapt to the data, can facilitate 
the discovery of novel biological hypotheses. Statistical learning and probabilistic 
models provide a natural theoretical framework for such analysis. 

In this thesis, general-purpose exploratory data analysis methods have been 
developed for organism- wide analysis of the human transcriptome, a central func- 
tional layer of the genome. Integrating evidence across multiple sources of genomic 
information can help to reveal mechanisms that could not be investigated based on 
smaller and more targeted experiments; this is a central aspect in all contributions. 
In particular, methods have been developed (i) in order to improve measurement 
accuracy of high-throughput observations, (ii) in order to model transcriptional 
activation patterns and tissue relatedness in genome-wide interaction networks at 
an organism-wide scale, and (iii) in order to integrate measurements of the human 
transcriptome with other layers of genomic information. These results contribute 
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to some of the 'grand challenges' in the genomic era by developing strategies to 

understand cell-biological systems, genetic contributions to human health and evo- 
lutionary variation (Collins et al., 2003). The computational experiments in this 
thesis have been carried out based on publicly available, anonymized data sets that 
follow commonly accepted ethical standards in biomedical research. Open access 
implementations of the key algorithms have been provided to guarantee wide ac- 
cess to these tools and to spark new research beyond the original applications 
presented in this thesis. 

Methodological extensions and application of the developed algorithms to new 
data integration tasks in functional genomics and in other fields provide a promis- 
ing line for future studies. The methods developed in this thesis arc readily ap- 
plicable in genome- wide screening studies in cancer and potentially other diseases. 
Increasing amounts of co-occurring data concerning various aspects of the genome 
have become available, including gene- and micro- RNA expression, structural vari- 
ation in the DNA, epigcnctic modifications and gene regulatory networks. It is ex- 
pected that with small modifications the introduced methodology can be applied 
to study further associations between these and other layers of genome organiza- 
tion, as well as their contributions to human health. The fundamental research 
challenges in contemporary genome biology provide a wide array of applications 
for statistical learning and exploratory analysis, and a rich source of ideas for 
methodological research. 
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